An Introduction to Molecular Biology/Function and structure of Proteins

Proteins were first described by the Dutch chemist Gerhardus Johannes Mulder and named by the Swedish chemist Jöns Jakob Berzelius in 1838. Early nutritional scientists such as the German Carl von Voit believed that protein was the most important nutrient for maintaining the structure of the body, because it was generally believed that "flesh makes flesh."

The amino acids in a polypeptide chain are linked by peptide bonds. Once linked in the protein chain, an individual amino acid is called a residue, and the linked series of carbon, nitrogen, and oxygen atoms are known as the main chain or protein backbone. The peptide bond has two resonance forms that contribute some double-bond character and inhibit rotation around its axis, so that the alpha carbons are roughly coplanar. The other two dihedral angles in the peptide bond determine the local shape assumed by the protein backbone. The end of the protein with a free carboxyl group is known as the C-terminus or carboxy terminus, whereas the end with a free amino group is known as the N-terminus or amino terminus.

The words protein, polypeptide, and peptide are a little ambiguous and can overlap in meaning. Protein is generally used to refer to the complete biological molecule in a stable conformation, whereas peptide is generally reserved for a short amino acid oligomers often lacking a stable three-dimensional structure. However, the boundary between the two is not well defined and usually lies near 20–30 residues. Polypeptide can refer to any single linear chain of amino acids, usually regardless of length, but often implies an absence of a defined Arginineconformation.



Amino acids


There are 22 standard amino acids, but only 21 are found in eukaryotes. Of the 22, 20 are directly encoded by the universal genetic code. Humans can synthesize 11 of these 20 from each other or from other molecules of intermediary metabolism. The other 9 must be consumed in the diet, and so are called essential amino acids; those are histidine, isoleucine, leucine, lysine, methionine, phenylalanine, threonine, tryptophan, and valine. The remaining two, selenocysteine and pyrrolysine, are incorporated into proteins by unique synthetic mechanisms.

Each α-amino acid consists of a backbone part that is present in all the amino acid types, and a side chain that is unique to each type of residue. An exception from this rule is proline, where the hydrogen atom is replaced by a bond to the side chain. Because the carbon atom is bound to four different groups it is chiral, however only one of the isomers occur in biological proteins. Glycine however, is not chiral since its side chain is a hydrogen atom. A simple mnemonic for correct L-form is "CORN": when the Cα atom is viewed with the H in front, the residues read "CO-R-N" in a clockwise direction.

Isomerism

The standard α-amino acids, all but glycine can exist in either of two optical isomers, called L or D amino acids, which are mirror images of each other. While L-amino acids represent all of the amino acids found in proteins during translation in the ribosome, D-amino acids are found in some proteins produced by enzyme posttranslational modifications after translation and translocation to the endoplasmic reticulum, as in exotic sea-dwelling organisms such as cone snails. They are also abundant components of the peptidoglycan cell walls of bacteria, and D-serine may act as a neurotransmitter in the brain. The L and D convention for amino acid configuration refers not to the optical activity of the amino acid itself, but rather to the optical activity of the isomer of glyceraldehyde from which that amino acid can theoretically be synthesized (D-glyceraldehyde is dextrorotary; L-glyceraldehyde is levorotary). Alternatively, the (S) and (R) designators are used to indicate the absolute stereochemistry. Almost all of the amino acids in proteins are (S) at the α carbon, with cysteine being (R) and glycine non-chiral.Cysteine is unusual since it has a sulfur atom at the second position in its side-chain, which has a larger atomic mass than the groups attached to the first carbon which is attached to the α-carbon in the other standard amino acids, thus the (R) instead of (S).

Zwitterions

The amine and carboxylic acid functional groups found in amino acids allow it to have amphiprotic properties. At a certain pH, known as the isoelectric point, an amino acid has no overall charge since the number of protonated ammonia groups (positive charges) and deprotonated carboxylate groups (negative charges) are equal. The amino acids all have different isoelectric points. The ions produced at the isoelectric point have both positive and negative charges and are known as a zwitterion, which comes from the German word Zwitter meaning "hermaphrodite" or "hybrid". Amino acids can exist as zwitterions in solids and in polar solutions such as water, but not in the gas phase. Zwitterions have minimal solubility at their isolectric point and an amino acid can be isolated by precipitating it from water by adjusting the pH to its particular isoelectric point.

The 20 naturally occurring amino acids have different physical and chemical properties, including their electrostatic charge, pKa, hydrophobicity, size and specific functional groups. These properties play a major role in molding protein structure. The salient features of amino acids are described below in the table.

Classification of aminoacids
The 20 amino acids encoded directly by the genetic code can be divided into several groups based on their properties. Important factors are charge, hydrophilicity or hydrophobicity, size and functional groups.Amino acids are usually classified by the properties of their side chain into four groups. The side chain can make an amino acid a weak acid or a weak base, and a hydrophile if the side chain is polar or a hydrophobe if it is nonpolar.

Protein amino acids are combined into a single polypeptide chain in a condensation reaction. This reaction is catalysed by the ribosome in a process known as translation.

Polar and non polar amino acids and their single and three letter code

Additionally, there are two additional amino acids which are incorporated by overriding stop codons:

In addition to the specific amino acid codes, placeholders are used in cases where chemical or crystallographic analysis of a peptide or protein can not conclusively determine the identity of a residue.

Unk is sometimes used instead of Xaa, but is less standard.

Additionally, many non-standard amino acids have a specific code. For example, several peptide drugs, such as Bortezomib or MG132 are artificially synthesized and retain their protecting groups, which have specific codes. Bortezomib is Pyz-Phe-boroLeu and MG132 is Z-Leu-Leu-Leu-al. Additionally, To aid in the analysis of protein structure, photocrosslinking amino acid analogues are available. These include photoleucine (pLeu) and photomethionine (pMet).

Peptide bond


A peptide bond (amide bond) is a covalent chemical bond formed between two molecules when the carboxyl group of one molecule reacts with the amino group of the other molecule, thereby releasing a molecule of water (H2O). This is a dehydration synthesis reaction (also known as a condensation reaction), and usually occurs between amino acids. The resulting C(O)NH bond is called a peptide bond, and the resulting molecule is an amide. The four-atom functional group -C(=O)NH- is called a peptide link. Polypeptides and proteins are chains of amino acids held together by peptide bonds, as is the backbone of PNA.

A peptide bond can be broken by amide hydrolysis (the adding of water). The peptide bonds in proteins are metastable, meaning that in the presence of water they will break spontaneously, releasing 2-4 kcal/mol of free energy, but this process is extremely slow. In living organisms, the process is facilitated by enzymes. Living organisms also employ enzymes to form peptide bonds; this process requires free energy. The wavelength of absorbance for a peptide bond is 190-230 nm.

The peptide bond tend to be planar due to the delocalization of the electrons from the double bond. The rigid peptide dihedral angle, ω (the bond between C1 and N) is always close to 180 degrees. The dihedral angles phi φ (the bond between N and Cα) and psi ψ (the bond between Cα and C1) can have a certain range of possible values. These angles are the internal degrees of freedom of a protein, they control the protein's conformation. They are restrained by geometry to allowed ranges typical for particular secondary structure elements, and represented in a Ramachandran plot. A few important bond lengths are given in the table below.

β-peptides
In α amino acids (molecule at left), both the carboxylic acid group (red) and the amino group (blue) are bonded to the same carbon center, termed the α carbon ($$\mathrm{C}^{\alpha}$$) because it is one atom away from the carboxylate group. In β amino acids, the amino group is bonded to the β carbon ($$\mathrm{C}^{\beta}$$), which is found in most of the 20 standard amino acids. Only Glycine lacks a β carbon, which means that β-glycine is not possible.

The chemical synthesis of β amino acids can be challenging, especially given the diversity of functional groups bonded to the β carbon and the necessity of maintaining chirality. In the alanine molecule shown, the β carbon is achiral; however, most larger amino acids have a chiral $$\mathrm{C}^{\beta}$$ atom. A number of synthesis mechanisms have been introduced to efficiently form β amino acids and their derivatives notably those based on the Arndt-Eistert synthesis.

Two main types of β-peptides exist: those with the organic residue (R) next to the amine are called β3-peptides and those with position next to the carbonyl group are called β2-peptides.

Enzymes
Enzymes are generally globular proteins and range from just 62 amino acid residues in size, for the monomer of 4-oxalocrotonate tautomerase, to over 2,500 residues in the animal fatty acid synthase. A small number of RNA-based biological catalysts exist, with the most common being the ribosome; these are referred to as either RNA-enzymes or ribozymes. The activities of enzymes are determined by their three-dimensional structure. However, although structure does determine function, predicting a novel enzyme's activity just from its structure is a very difficult problem that has not yet been solved.

Most enzymes are much larger than the substrates they act on, and only a small portion of the enzyme (around 3–4 amino acids) is directly involved in catalysis. The region that contains these catalytic residues, binds the substrate, and then carries out the reaction is known as the active site. Enzymes can also contain sites that bind cofactors, which are needed for catalysis. Some enzymes also have binding sites for small molecules, which are often direct or indirect products or substrates of the reaction catalyzed. This binding can serve to increase or decrease the enzyme's activity, providing a means for feedback regulation. Like all proteins, enzymes are long, linear chains of amino acids that fold to produce a three-dimensional product. Each unique amino acid sequence produces a specific structure, which has unique properties. Individual protein chains may sometimes group together to form a protein complex. Most enzymes can be denatured—that is, unfolded and inactivated—by heating or chemical denaturants, which disrupt the three-dimensional structure of the protein. Depending on the enzyme, denaturation may be reversible or irreversible. Structures of enzymes in complex with substrates or substrate analogs during a reaction may be obtained using Time resolved crystallography methods.

Classification of enzymes
An enzyme's name is often derived from its substrate or the chemical reaction it catalyzes, with the word ending in -ase. Examples are lactase, alcohol dehydrogenase and DNA polymerase. This may result in different enzymes, called isozymes, with the same function having the same basic name. Isoenzymes have a different amino acid sequence and might be distinguished by their optimal pH, kinetic properties or immunologically. Isoenzyme and isozyme are homologous proteins. Furthermore, the normal physiological reaction an enzyme catalyzes may not be the same as under artificial conditions. This can result in the same enzyme being identified with two different names. E.g. Glucose isomerase, used industrially to convert glucose into the sweetener fructose, is a xylose isomerase in vivo. The International Union of Biochemistry and Molecular Biology have developed a nomenclature for enzymes, the EC numbers. The Enzyme Commission number (EC number) is a numerical classification scheme for enzymes, based on the chemical reactions they catalyze. As a system of enzyme nomenclature, every EC number is associated with a recommended name for the respective enzyme. Each enzyme is described by a sequence of four numbers preceded by "EC". The first number broadly classifies the enzyme based on its mechanism. Strictly speaking, EC numbers do not specify enzymes, but enzyme-catalyzed reactions. If different enzymes (for instance from different organisms) catalyze the same reaction, then they receive the same EC number. By contrast, UniProt identifiers uniquely specify a protein by its amino acid sequence.

EC 1 Oxidoreductases: catalyze oxidation/reduction reactions

EC 2 Transferases: transfer a functional group (e.g. a methyl or phosphate group)

EC 3 Hydrolases: catalyze the hydrolysis of various bonds

EC 4 Lyases: cleave various bonds by means other than hydrolysis and oxidation

EC 5 Isomerases: catalyze isomerization changes within a single molecule

EC 6 Ligases: join two molecules with covalent bonds.

Oxidoreductase
In molecular biology and biochemistry, an oxidoreductase is an enzyme that catalyzes the transfer of electrons from one molecule (the reductant, also called the hydrogen or electron donor) to another (the oxidant, also called the hydrogen or electron acceptor). This group of enzymes usually utilizes NADP or NAD as cofactors. In general, polypeptides are unbranched polymers, so their primary structure can often be specified by the sequence of amino acids along their backbone. However, proteins can become cross-linked, most commonly by disulfide bonds, and the primary structure also requires specifying the cross-linking atoms, e.g., specifying the cysteines involved in the protein's disulfide bonds. Other crosslinks include desmosine... The chiral centers of a polypeptide chain can undergo racemization. In particular, the L-amino acids normally found in proteins can spontaneously isomerize at the Cα atom to form D-amino acids, which cannot be cleaved by most proteases.

Primary structure of protein
The proposal that proteins were linear chains of α-amino acids was made nearly simultaneously by two scientists at the same conference in 1902, the 74th meeting of the Society of German Scientists and Physicians, held in Karlsbad. Franz Hofmeister made the proposal in the morning, based on his observations of the biuret reaction in proteins. Hofmeister was followed a few hours later by Emil Fischer, who had amassed a wealth of chemical details supporting the peptide-bond model. For completeness, the proposal that proteins contained amide linkages was made as early as 1882 by the French chemist E. Grimaux.

Despite these data and later evidence that proteolytically digested proteins yielded only oligopeptides, the idea that proteins were linear, unbranched polymers of amino acids was not accepted immediately. Some well-respected scientists such as William Astbury doubted that covalent bonds were strong enough to hold such long molecules together; they feared that thermal agitations would shake such long molecules asunder. Hermann Staudinger faced similar prejudices in the 1920s when he argued that rubber was composed of macromolecules. Thus, several alternative hypotheses arose. The colloidal protein hypothesis stated that proteins were colloidal assemblies of smaller molecules. This hypothesis was disproved in the 1920s by ultracentrifugation measurements by Theodor Svedberg that showed that proteins had a well-defined, reproducible molecular weight and by electrophoretic measurements by Arne Tiselius that indicated that proteins were single molecules.

A second hypothesis, the cyclol hypothesis advanced by Dorothy Wrinch, proposed that the linear polypeptide underwent a chemical cyclol rearrangement C=O + HN C(OH)-N that crosslinked its backbone amide groups, forming a two-dimensional fabric. Other primary structures of proteins were proposed by various researchers, such as the diketopiperazine model of Emil Abderhalden and the pyrrol/piperidine model of Troensegaard in 1942. Although never given much credence, these alternative models were finally disproved when Frederick Sanger successfully sequenced insulin and by the crystallographic determination of myoglobin and hemoglobin by Max Perutz and John Kendrew.

The primary structure of peptides and proteins refers to the linear sequence of its amino acid structural units. The term "primary structure" was first coined by Linderstrøm-Lang in 1951. By convention, the primary structure of a protein is reported starting from the amino-terminal (N) end to the carboxyl-terminal (C) end. The post-translational modifications of protein such as disulfide formation, phosphorylations and glycosylations are usually also considered a part of the primary structure, and cannot be read from the gene.

Secondary structure of protein
Secondary structure refers to highly regular local sub-structures. Two main types of secondary structure, the alpha helix and the beta strand, were suggested in 1951 by Linus Pauling' and coworkers. These secondary structures are defined by patterns of hydrogen bonds between the main-chain peptide groups. They have a regular geometry, being constrained to specific values of the dihedral angles ψ and φ on the Ramachandran plot. Both the alpha helix and the beta-sheet represent a way of saturating all the hydrogen bond donors and acceptors in the peptide backbone. Some parts of the protein are ordered but do not form any regular structures. They should not be confused with random coil, an unfolded polypeptide chain lacking any fixed three-dimensional structure. Several sequential secondary structures may form a "supersecondary unit". Amino acids vary in their ability to form the various secondary structure elements. Proline and glycine are sometimes known as "helix breakers" because they disrupt the regularity of the α helical backbone conformation; however, both have unusual conformational abilities and are commonly found in turns. Amino acids that prefer to adopt helical conformations in proteins include methionine, alanine, leucine, glutamate and lysine ("MALEK" in amino-acid 1-letter codes); by contrast, the large aromatic residues (tryptophan, tyrosine and phenylalanine) and Cβ-branched amino acids (isoleucine, valine, and threonine) prefer to adopt β-strand conformations. However, these preferences are not strong enough to produce a reliable method of predicting secondary structure from sequence alone.Secondary structure in proteins consists of local inter-residue interactions mediated by hydrogen bonds, or not. The most common secondary structures are alpha helices and beta sheets. Other helices, such as the 310 helix and π helix, are calculated to have energetically favorable hydrogen-bonding patterns but are rarely if ever observed in natural proteins except at the ends of α helices due to unfavorable backbone packing in the center of the helix.

α helix

The amino acids in an α helix are arranged in a right-handed helical structure where each amino acid residue corresponds to a 100° turn in the helix (i.e., the helix has 3.6 residues per turn), and a translation of 1.5 Å (0.15 nm) along the helical axis. (Short pieces of left-handed helix sometimes occur with a large content of achiral glycine amino acids, but are unfavorable for the other normal, biological L-amino acids.) The pitch of the alpha-helix (the vertical distance between one consecutive turn of the helix) is 5.4 Å (0.54 nm) which is the product of 1.5 and 3.6. What is most important is that the N-H group of an amino acid forms a hydrogen bond with the C=O group of the amino acid four residues earlier; this repeated hydrogen bonding is the most prominent characteristic of an α-helix. Official international nomenclature specifies two ways of defining α-helices, rule 6.2 in terms of repeating φ,ψ torsion angles and rule 6.3 in terms of the combined pattern of pitch and hydrogen bonding. Different amino-acid sequences have different propensities for forming α-helical structure. Methionine, alanine, leucine, uncharged glutamate, and lysine ("MALEK" in the amino-acid 1-letter codes) all have especially high helix-forming propensities, whereas proline and glycine have poor helix-forming propensities. Proline either breaks or kinks a helix, both because it cannot donate an amide hydrogen bond (having no amide hydrogen), and also because its sidechain interferes sterically with the backbone of the preceding turn - inside a helix, this forces a bend of about 30° in the helix axis.[9] However, proline is often seen as the first residue of a helix, presumably due to its structural rigidity. At the other extreme, glycine also tends to disrupt helices because its high conformational flexibility makes it entropically expensive to adopt the relatively constrained α-helical structure. β sheet The first β sheet structure was proposed by William Astbury in the 1930s. He proposed the idea of hydrogen bonding between the peptide bonds of parallel or antiparallel extended β strands. However, Astbury did not have the necessary data on the bond geometry of the amino acids in order to build accurate models, especially since he did not then know that the peptide bond was planar. A refined version was proposed by Linus Pauling and Robert Corey in 1951.

The β sheet (also β-pleated sheet) is the second form of regular secondary structure in proteins, only somewhat less common than alpha helix. Beta sheets consist of beta strands connected laterally by at least two or three backbone hydrogen bonds, forming a generally twisted, pleated sheet. A beta strand (also β strand) is a stretch of polypeptide chain typically 3 to 10 amino acids long with backbone in an almost fully extended conformation.

A very simple structural motif involving β sheets is the β hairpin, in which two antiparallel strands are linked by a short loop of two to five residues, of which one is frequently a Glycine or a proline, both of which can assume the unusual dihedral-angle conformations required for a tight turn. However, individual strands can also be linked in more elaborate ways with long loops that may contain alpha helices or even entire protein domains.

Greek key motif The Greek key motif consists of four adjacent antiparallel strands and their linking loops. It consists of three antiparallel strands connected by hairpins, while the fourth is adjacent to the first and linked to the third by a longer loop. This type of structure forms easily during the protein folding process. It was named after a pattern common to Greek ornamental artwork (see meander (art)).

The β-&alpha;-β motif Due to the chirality of their component amino acids, all strands exhibit a "right-handed" twist evident in most higher-order β sheet structures. In particular, the linking loop between two parallel strands almost always has a right-handed crossover chirality, which is strongly favored by the inherent twist of the sheet. This linking loop frequently contains a helical region, in which case it is called a β-α-β motif. A closely related motif called a β-α-β-α motif forms the basic component of the most commonly observed protein tertiary structure, the TIM barrel.

β-meander motif A simple supersecondary protein topology composed of 2 or more consecutive antiparallel β-strands linked together by hairpin loops. This motif is common in β-sheets and can be found in several structural architectures including β-barrels and β-propellers.

Psi-loop motif The psi-loop, Ψ-loop, motif consists of two antiparallel strands with one strand in between that is connected to both by hydrogen bonds. There are four possible strand topologies for single Ψ-loops as cited by Hutchinson et al. (1990). This motif is rare as the process resulting in its formation seems unlikely to occur during protein folding. The Ψ-loop was first identified in the aspartic protease family.

Coiled coils

The possibility of coiled coils for α-keratin was proposed by Francis Crick in 1952 as well as mathematical methods for determining their structure. Remarkably, this was soon after the structure of the alpha helix was suggested in 1951 by Linus Pauling and coworkers.

Coiled coils usually contain a repeated pattern, hxxhcxc, of hydrophobic (h) and charged (c) amino-acid residues, referred to as a heptad repeat. The positions in the heptad repeat are usually labeled abcdefg, where a and d are the hydrophobic positions, often being occupied by isoleucine, leucine or valine. Folding a sequence with this repeating pattern into an alpha-helical secondary structure causes the hydrophobic residues to be presented as a 'stripe' that coils gently around the helix in left-handed fashion, forming an amphipathic structure. The most favorable way for two such helices to arrange themselves in the water-filled environment of the cytoplasm is to wrap the hydrophobic strands against each other sandwiched between the hydrophilic amino acids. It is thus the burial of hydrophobic surfaces, that provides the thermodynamic driving force for the oligomerization. The packing in a coiled-coil interface is exceptionally tight, with almost complete van der Waals contact between the side chains of the a and d residues. This tight packing was originally predicted by Francis Crick in 1952 and is referred to as Knobs into holes packing. The α-helices may be parallel or anti-parallel, and usually adopt a left-handed super-coil. Although disfavored, a few right-handed coiled coils have also been observed in nature and in designed proteins.



Tertiary structure of protein
Tertiary structure is considered to be largely determined by the protein's primary structure - the sequence of amino acids of which it is composed. Efforts to predict tertiary structure from the primary structure are known generally as protein structure prediction. However, the environment in which a protein is synthesized and allowed to fold are significant determinants of its final shape and are usually not directly taken into account by current prediction methods.

In globular proteins, tertiary interactions are frequently stabilized by the sequestration of hydrophobic amino acid residues in the protein core, from which water is excluded, and by the consequent enrichment of charged or hydrophilic residues on the protein's water-exposed surface. In secreted proteins that do not spend time in the cytoplasm, disulfide bonds between cysteine residues help to maintain the protein's tertiary structure. A variety of common and stable tertiary structures appear in a large number of proteins that are unrelated in both function and evolution - for example, many proteins are shaped like a TIM barrel, named for the enzyme triosephosphateisomerase. Another common structure is a highly stable dimeric coiled coil structure composed of 2-7 alpha helices.

The majority of protein structures known to date have been solved with the experimental technique of X-ray crystallography, which typically provides data of high resolution but provides no time-dependent information on the protein's conformational flexibility. A second common way of solving protein structures uses NMR, which provides somewhat lower-resolution data in general and is limited to relatively small proteins, but can provide time-dependent information about the motion of a protein in solution. Dual polarisation interferometry is a time resolved analytical method for determining the overall conformation and conformational changes in surface captured proteins providing complementary information to these high resolution methods. More is known about the tertiary structural features of soluble globular proteins than about membrane proteins because the latter class is extremely difficult to study using these methods.

Quaternary structure of proteins
Several proteins are actually assemblies of more than one polypeptide chain, which in the context of the larger assemblage are known as protein subunits. In addition to the tertiary structure of the subunits, multiple-subunit proteins possess a quaternary structure, which is the arrangement into which the subunits assemble. Enzymes composed of subunits with diverse functions are sometimes called holoenzymes, in which some parts may be known as regulatory subunits and the functional core is known as the catalytic subunit. Examples of proteins with quaternary structure include hemoglobin, DNA polymerase, and ion channels. Other assemblies referred to instead as multiprotein complexes also possess quaternary structure. Examples include nucleosomes and microtubules.

Changes in quaternary structure can occur through conformational changes within individual subunits or through reorientation of the subunits relative to each other. It is through such changes, which underlie cooperativity and allostery in "multimeric" enzymes, that many proteins undergo regulation and perform their physiological function. The above definition follows a classical approach to biochemistry, established at times when the distinction between a protein and a functional, proteinaceous unit was difficult to elucidate. More recently, people refer to protein-protein interaction when discussing quaternary structure of proteins and consider all assemblies of proteins as protein complexes.

Protein structure determination
Around 90% of the protein structures available in the Protein Data Bank have been determined by X-ray crystallography. This method allows one to measure the 3D density distribution of electrons in the protein (in the crystallized state) and thereby infer the 3D coordinates of all the atoms to be determined to a certain resolution. Roughly 9% of the known protein structures have been obtained by Nuclear Magnetic Resonance techniques. The secondary structure composition can be determined via circular dichroism or dual polarisation interferometry. Cryo-electron microscopy has recently become a means of determining protein structures to high resolution (less than 5 angstroms or 0.5 nanometer) and is anticipated to increase in power as a tool for high resolution work in the next decade. This technique is still a valuable resource for researchers working with very large protein complexes such as virus coat proteins and amyloid fibers.

X-ray crystallography
X-ray crystallography of biological molecules took off with Dorothy Crowfoot Hodgkin, who solved the structures of cholesterol (1937), vitamin B12 (1945) and penicillin (1954), for which she was awarded the Nobel Prize in Chemistry in 1964. In 1969, she succeeded in solving the structure of insulin, on which she worked for over thirty years.

X-ray crystallography is a method of determining the arrangement of atoms within a crystal, in which a beam of X-rays strikes a crystal and diffracts into many specific directions. Crystal structures of proteins (which are irregular and hundreds of times larger than cholesterol) began to be solved in the late 1950s, beginning with the structure of sperm whale myoglobin by Max Perutz and Sir John Cowdery Kendrew, for which they were awarded the Nobel Prize in Chemistry in 1962. Since that success, over 61840 X-ray crystal structures of proteins, nucleic acids and other biological molecules have been determined. For comparison, the nearest competing method in terms of structures analyzed is nuclear magnetic resonance (NMR) spectroscopy, which has resolved 8759 chemical structures. Moreover, crystallography can solve structures of arbitrarily large molecules, whereas solution-state NMR is restricted to relatively small ones (less than 70 kDa). X-ray crystallography is now used routinely by scientists to determine how a pharmaceutical drug interacts with its protein target and what changes might improve it. However, intrinsic membrane proteins remain challenging to crystallize because they require detergents or other means to solubilize them in isolation, and such detergents often interfere with crystallization. Such membrane proteins are a large component of the genome and include many proteins of great physiological importance, such as ion channels and receptors.

Nuclear magnetic resonance spectroscopy or NMR
Protein nuclear magnetic resonance spectroscopy (usually abbreviated protein NMR) is a field of structural biology in which NMR spectroscopy is used to obtain information about the structure and dynamics of proteins. The field was pioneered by Richard R. Ernst and Kurt Wüthrich[1], among others. Protein NMR techniques are continually being used and improved in both academia and the biotech industry. Structure determination by NMR spectroscopy usually consists of several following phases, each using a separate set of highly specialized techniques. The sample is prepared, resonances are assigned, restraints are generated and a structure is calculated and validated

How to sequence a protein?
Protein sequencing is a technique to determine the amino acid sequence of a protein, as well as which conformation the protein adopts and the extent to which it is complexed with any non-peptide molecules. Discovering the structures and functions of proteins in living organisms is an important tool for understanding cellular processes, and allows drugs that target specific metabolic pathways to be invented more easily. The two major direct methods of protein sequencing are mass spectrometry and the Edman degradation reaction. It is also possible to generate an amino acid sequence from the DNA or mRNA sequence encoding the protein, if this is known. However, there are a number of other reactions which can be used to gain more limited information about protein sequences and can be used as preliminaries to the aforementioned methods of sequencing or to overcome specific inadequacies within them.

Edman degradation
The Edman degradation is a very important reaction for protein sequencing, because it allows the ordered amino acid composition of a protein to be discovered. Automated Edman sequencers are now in widespread use, and are able to sequence peptides up to approximately 50 amino acids long. A reaction scheme for sequencing a protein by the Edman degradation follows - some of the steps are elaborated on subsequently. Break any disulfide bridges in the protein with an oxidising agent like performic acid or reducing agent like 2-mercaptoethanol. A protecting group such as iodoacetic acid may be necessary to prevent the bonds from re-forming. Separate and purify the individual chains of the protein complex, if there are more than one.

Determine the amino acid composition of each chain.

Determine the terminal amino acids of each chain.

Break each chain into fragments under 50 amino acids long.

Separate and purify the fragments.

Determine the sequence of each fragment.

Repeat with a different pattern of cleavage.

Construct the sequence of the overall protein.

Digestion into peptide fragments Peptides longer than about 50-70 amino acids long cannot be sequenced reliably by the Edman degradation. Because of this, long protein chains need to be broken up into small fragments which can then be sequenced individually. Digestion is done either by endopeptidases such as trypsin or pepsin or by chemical reagents such as cyanogen bromide. Different enzymes give different cleavage patterns, and the overlap between fragments can be used to construct an overall sequence.

Phenylisothiocyanate is reacted with an uncharged terminal amino group, under mildly alkaline conditions, to form a cyclical phenylthiocarbamoyl derivative. Then, under acidic conditions, this derivative of the terminal amino acid is cleaved as a thiazolinone derivative. The thiazolinone amino acid is then selectively extracted into an organic solvent and treated with acid to form the more stable phenylthiohydantoin (PTH)- amino acid derivative that can be identified by using chromatography or electrophoresis. This procedure can then be repeated again to identify the next amino acid. A major drawback to this technique is that the peptides being sequenced in this manner cannot have more than 50 to 60 residues (and in practice, under 30). The peptide length is limited due to the cyclical derivitization not always going to completion. The derivitization problem can be resolved by cleaving large peptides into smaller peptides before proceeding with the reaction. It is able to accurately sequence up to 30 amino acids with modern machines capable of over 99% efficiency per amino acid. An advantage of the Edman degradation is that it only uses 10 - 100 picomoles of peptide for the sequencing process. Edman degradation reaction is automated to speed up the process.

N-terminal amino acid analysis
Determining which amino acid forms the N-terminus of a peptide chain is useful for two reasons: to aid the ordering of individual peptide fragments' sequences into a whole chain, and because the first round of Edman degradation is often contaminated by impurities and therefore does not give an accurate determination of the N-terminal amino acid. A generalised method for N-terminal amino acid analysis follows: React the peptide with a reagent which will selectively label the terminal amino acid. Hydrolyse the protein. Determine the amino acid by chromatography and comparison with standards. There are many different reagents which can be used to label terminal amino acids. They all react with amine groups and will therefore also bind to amine groups in the side chains of amino acids such as lysine - for this reason it is necessary to be careful in interpreting chromatograms to ensure that the right spot is chosen. Two of the more common reagents are Sanger's reagent (1-fluoro-2,4-dinitrobenzene) and dansyl derivatives such as dansyl chloride. Phenylisothiocyanate, the reagent for the Edman degradation, can also be used. The same questions apply here as in the determination of amino acid composition, with the exception that no stain is needed, as the reagents produce coloured derivatives and only qualitative analysis is required, so the amino acid does not have to be eluted from the chromatography column, just compared with a standard. Another consideration to take into account is that, since any amine groups will have reacted with the labelling reagent, ion exchange chromatography cannot be used, and thin layer chromatography or high pressure liquid chromatography should be used instead.

C-terminal amino acid analysis
The number of methods available for C-terminal amino acid analysis is much smaller than the number of available methods of N-terminal analysis. The most common method is to add carboxypeptidases to a solution of the protein, take samples at regular intervals, and determine the terminal amino acid by analysing a plot of amino acid concentrations against time

Mass spectrometry
Present day researchers are using Mass spectrometry an important tool for the characterization of proteins. Protein mass spectrometry refers to the application of mass spectrometry to the study of proteins. The two primary methods for ionization of whole proteins are electrospray ionization (ESI) and matrix-assisted laser desorption/ionization (MALDI). In keeping with the performance and mass range of available mass spectrometers, two approaches are used for characterizing proteins. In the first, intact proteins are ionized by either of the two techniques described above, and then introduced to a mass analyzer. This approach is referred to as "top-down" strategy of protein analysis. In the second, proteins are enzymatically digested into smaller peptides using a protease such as trypsin. Subsequently these peptides are introduced into the mass spectrometer and identified by peptide mass fingerprinting or tandem mass spectrometry. Hence, this latter approach (also called "bottom-up" proteomics) uses identification at the peptide level to infer the existence of proteins.

Whole protein mass analysis is primarily conducted using either time-of-flight (TOF) MS, or Fourier transform ion cyclotron resonance (FT-ICR). These two types of instrument are preferable here because of their wide mass range, and in the case of FT-ICR, its high mass accuracy. Mass analysis of proteolytic peptides is a much more popular method of protein characterization, as cheaper instrument designs can be used for characterization. Additionally, sample preparation is easier once whole proteins have been digested into smaller peptide fragments. The most widely used instrument for peptide mass analysis are the MALDI time-of-flight instruments as they permit the acquisition of peptide mass fingerprints (PMFs) at high pace (1 PMF can be analyzed in approx. 10 sec). Multiple stage quadrupole-time-of-flight and the quadrupole ion trap also find use in this application.

Conjugated protein
A conjugated protein is a protein that functions in interaction with other chemical groups attached by covalent bonds or by weak interactions. Many proteins contain only amino acids and no other chemical groups, and they are called simple proteins. However, other kind of proteins yield, on hydrolysis, some other chemical component in addition to amino acids and they are called conjugated proteins. The nonamino part of a conjugated protein is usually called its prosthetic group. Most prosthetic groups are formed from vitamins. Conjugated proteins are classified on the basis of the chemical nature of their prosthetic groups. Some examples of conjugated proteins are

Lipoproteins
A lipoprotein is a biochemical assembly that contains both proteins and lipids water-bound to the proteins. Many enzymes, transporters, structural proteins, antigens, adhesins and toxins are lipoproteins. Examples include the high density (HDL) and low density (LDL) lipoproteins which enable fats to be carried in the blood stream, the transmembrane proteins of the mitochondrion and the chloroplast, and bacterial lipoproteins.

Glycoproteins
Glycoproteins are proteins that contain oligosaccharide chains (glycans) covalently attached to polypeptide side-chains. The carbohydrate is attached to the protein in a cotranslational or posttranslational modification. This process is known as glycosylation. In proteins that have segments extending extracellularly, the extracellular segments are often glycosylated. Glycoproteins are often important integral membrane proteins, where they play a role in cell-cell interactions. Glycoproteins also occur in the cytosol, but their functions and the pathways producing these modifications in this compartment are less well-understood.Glycoproteins are generally the largest and most abundant group of conjugated proteins. They range from glycoproteins in cell surface membranes that constitute the glycocalyx, to important antibodies produced by leukocytes.

phosphoproteins
Phosphoproteins are proteins which are chemically bonded to a substance containing phosphoric acid (see phosphorylation for more). The category of organic molecules that includes Fc receptors, Ulks, Calcineurins, K chips, and urocortins.

Metalloprotein
A protein that contains a metal ion ass cofactor known as Metalloprotein. Metalloproteins have many different functions in cells, such as enzymes, transport and storage proteins, and signal transduction proteins. Indeed, about one quarter to one third of all proteins require metals to carry out their functions. The metal ion is usually coordinated by nitrogen, oxygen or sulfur atoms belonging to amino acids in the polypeptide chain and/or a macrocyclic ligand incorporated into the protein. The presence of the metal ion allows metalloenzymes to perform functions such as redox reactions that cannot easily be performed by the limited set of functional groups found in amino acids.



hemoproteins
A hemeprotein (or hemoprotein or haemoprotein), or heme protein, is a metalloprotein containing a heme prosthetic group, either covalently or noncovalently bound to the protein itself. The iron in the heme is capable of undergoing oxidation and reduction (usually to +2 and +3, though stabilized Fe+4 and even Fe+5 species are well known in the peroxidases). Hemoproteins probably evolved from a primordial strategy allowing to incorporate the iron (Fe) atom contained within the protoporphyrin IX ring of heme into proteins. This strategy has been maintained throughout evolution as it makes hemoproteins responsive to molecules that can bind divalent iron (Fe). These molecules included, but are probably not restricted to, gaseous molecules, such as oxygen (O2) nitric oxide (NO), carbon monoxide (CO) and hydrogen sulfide (H2S). Once bound to the prosthetic heme groups of hemoproteins these gaseous molecules can modulate the activity/function of those hemoproteins in a way that is said to afford signal transduction. Therefore, when produced in biologic systems (cells), these gaseous molecules are referred to as gasotransmitters.Haemoglobin contains the prosthetic group containing iron, which is the haem. It is with in the haem group that carries the oxygen molecule through the binding of the oxygen molecule to the iron ion (Fe2+) found in the haem group.

Hemoglobin Hemoglobin (also spelled haemoglobin and abbreviated Hb or Hgb) is the iron-containing oxygen-transport metalloprotein in the red blood cells of all vertebrates[1] (except the fish family Channichthyidae ) and the tissues of some invertebrates. Hemoglobin in the blood is what transports oxygen from the lungs or gills to the rest of the body (i.e. the tissues) where it releases the oxygen for cell use, and collects carbon dioxide to bring it back to the lungs. In mammals the protein makes up about 97% of the red blood cells' dry content, and around 35% of the total content (including water)[citation needed]. Hemoglobin has an oxygen binding capacity of 1.34 ml O2 per gram of hemoglobin, which increases the total blood oxygen capacity seventyfold. Hemoglobin is involved in the transport of other gases: it carries some of the body's respiratory carbon dioxide (about 10% of the total) as carbaminohemoglobin, in which CO2 is bound to the globin protein. The molecule also carries the important regulatory molecule nitric oxide bound to a globin protein thiol group, releasing it at the same time as oxygen. Hemoglobin is also found outside red blood cells and their progenitor lines. Other cells that contain hemoglobin include the A9 dopaminergic neurons in the substantia nigra, macrophages, alveolar cells, and mesangial cells in the kidney. In these tissues, hemoglobin has a non-oxygen-carrying function as an antioxidant and a regulator of iron metabolism. Hemoglobin and hemoglobin-like molecules are also found in many invertebrates, fungi, and plants. In these organisms, hemoglobins may carry oxygen, or they may act to transport and regulate other things such as carbon dioxide, nitric oxide, hydrogen sulfide and sulfide. A variant of the molecule, called leghemoglobin, is used to scavenge oxygen, to keep it from poisoning anaerobic systems, such as nitrogen-fixing nodules of leguminous plants. phytochromes,

Cytochromes Cytochromes are, in general, membrane-bound hemoproteins that contain heme groups and carry out electron transport. They are found either as monomeric proteins (e.g., cytochrome c) or as subunits of bigger enzymatic complexes that catalyze redox reactions. They are found in the mitochondrial inner membrane and endoplasmic reticulum of eukaryotes, in the chloroplasts of plants, in photosynthetic microorganisms, and in bacteria.



Opsins
Opsins are a group of light-sensitive 35-55 kDa membrane-bound G protein-coupled receptors of the retinylidene protein family found in photoreceptor cells of the retina. Five classical groups of opsins are involved in vision, mediating the conversion of a photon of light into an electrochemical signal, the first step in the visual transduction cascade. Another opsin found in the mammalian retina, melanopsin, is involved in circadian rhythms and pupillary reflex but not in image-forming.

Flavoproteins
Flavoproteins are proteins that contain a nucleic acid derivative of riboflavin: the flavin adenine dinucleotide (FAD) or flavin mononucleotide (FMN). Flavoproteins are involved in a wide array of biological processes, including, but by no means limited to, bioluminescence, removal of radicals contributing to oxidative stress, photosynthesis, DNA repair, and apoptosis. The spectroscopic properties of the flavin cofactor make it a natural reporter for changes occurring within the active site; this makes flavoproteins one of the most-studied enzyme families.

Simple proteins
The proteins which upon hydrolysis yield only amino acids are known as simple proteins.

Albumin
Albumin (Latin: albus, white) refers generally to any protein that is water soluble, which is moderately soluble in concentrated salt solutions, and experiences heat denaturation. They are commonly found in blood plasma, and are unique to other plasma proteins in that they are not glycosylated. Substances containing albumin, such as egg white, are called albuminoids.

Globulin
Globulin is one of the three types of serum proteins, the others being albumin and fibrinogen. Some globulins are produced in the liver, while others are made by the immune system. The term globulin encompasses a heterogeneous group of proteins with typical high molecular weight, and both solubility and electrophoretic migration rates lower than for albumin.

Histones
In biology, histones are highly alkaline proteins found in eukaryotic cell nuclei, which package and order the DNA into structural units called nucleosomes. They are the chief protein components of chromatin, acting as spools around which DNA winds, and play a role in gene regulation.

Peptones
Peptones are derived from animal milk or meat digested by proteolytic digestion. In addition to containing small peptides, the resulting spray-dried material includes fats, metals, salts, vitamins and many other biological compounds. Peptone is used in nutrient media for growing bacteria and fungi

Proteases
Proteases occur naturally in all organisms. These enzymes are involved in a multitude of physiological reactions from simple digestion of food proteins to highly-regulated cascades (e.g., the blood-clotting cascade, the complement system, apoptosis pathways, and the invertebrate prophenoloxidase-activating cascade). Proteases can either break specific peptide bonds (limited proteolysis), depending on the amino acid sequence of a protein, or break down a complete peptide to amino acids (unlimited proteolysis). The activity can be a destructive change, abolishing a protein's function or digesting it to its principal components; it can be an activation of a function, or it can be a signal in a signaling pathway.

Protein Data Bank or PDB
Like fuel and flame, two forces converged to initiate the Protein Data Bank (PDB): 1) a small but growing data base of sets of protein structures determined by X-ray diffraction and 2) the newly available (1968) molecular graphics display, the BRookhaven Raster Display (BRAD), to inspect these structures in 3-D. In 1969, with the sponsorship of Dr. Walter Hamilton at the Brookhaven National Laboratory, Dr. Edgar Meyer (Texas A&M University) began to write software to store atomic coordinate files in a common format to make them available for geometric and graphical evaluation. By 1971 program SEARCH was executed remotely to extract and examine structural data and thereby was instrumental in initiating networking, thus marking the functional beginning of the PDB.

Upon Hamilton's death in 1973, Dr. Tom Koeztle took over direction of the PDB for the subsequent 20 years. In January 1994, Dr. Joel Sussman of Israel's Weizmann Institute of Science was appointed head of the PDB. In October 1998, the PDB was transferred to the Research Collaboratory for Structural Bioinformatics (RCSB); the transfer was completed in June 1999. The new director was Dr. Helen M. Berman of Rutgers University (one of the member institutions of the RCSB). In 2003, with the formation of the wwPDB, the PDB became an international organization. The founding members are PDBe (Europe), RCSB(USA), and PDBj (Japan). The BMRB joined in 2006. Each of the four members of wwPDB can act as deposition, data processing and distribution centers for PDB data. The data processing refers to the fact that wwPDB staff review and annotates each submitted entry. The data are then automatically checked for plausibility (the source code for this validation software has been made available to the public at no charge).

The Protein Data Bank (PDB) is a repository for the 3-D structural data of large biological molecules, such as proteins and nucleic acids. (See also crystallographic database). The data, typically obtained by X-ray crystallography or NMR spectroscopy and submitted by biologists and biochemists from around the world, are freely accessible on the Internet via the websites of its member organisations (PDBe, PDBj, and RCSB). The PDB is overseen by an organization called the Worldwide Protein Data Bank, wwPDB.

Insulin


Within vertebrates, the amino acid sequence of insulin is extremely well preserved. Bovine insulin differs from human in only three amino acid residues, and porcine insulin in one. Even insulin from some species of fish is similar enough to human to be clinically effective in humans. Insulin in some invertebrates is quite similar in sequence to human insulin, and has similar physiological effects. The strong homology seen in the insulin sequence of diverse species suggests that it has been conserved across much of animal evolutionary history. The C-peptide of proinsulin, however, differs much more amongst species; it is also a hormone, but a secondary one.

Insulin is produced and stored in the body as a hexamer (a unit of six insulin molecules), while the active form is the monomer. The hexamer is an inactive form with long-term stability, which serves as a way to keep the highly reactive insulin protected, yet readily available. The hexamer-monomer conversion is one of the central aspects of insulin formulations for injection. The hexamer is far more stable than the monomer, which is desirable for practical reasons, however the monomer is a much faster reacting drug because diffusion rate is inversely related to particle size. A fast reacting drug means that insulin injections do not have to precede mealtimes by hours, which in turn gives diabetics more flexibility in their daily schedule. Insulin can aggregate and form fibrillar interdigitated beta-sheets. This can cause injection amyloidosis, and prevents the storage of insulin for long periods.

In 1869 Paul Langerhans, a medical student in Berlin, was studying the structure of the pancreas under a microscope when he identified some previously un-noticed tissue clumps scattered throughout the bulk of the pancreas. The function of the "little heaps of cells," later known as the Islets of Langerhans, was unknown, but Edouard Laguesse later suggested that they might produce secretions that play a regulatory role in digestion. Paul Langerhans' son, Archibald, also helped to understand this regulatory role. The term insulin origins from insula, the Latin word for islet/island. In 1889, the Polish-German physician Oscar Minkowski in collaboration with Joseph von Mering removed the pancreas from a healthy dog to test its assumed role in digestion. Several days after the dog's pancreas was removed, Minkowski's animal keeper noticed a swarm of flies feeding on the dog's urine. On testing the urine they found that there was sugar in the dog's urine, establishing for the first time a relationship between the pancreas and diabetes. In 1901, another major step was taken by Eugene Opie, when he clearly established the link between the Islets of Langerhans and diabetes: Diabetes mellitus … is caused by destruction of the islets of Langerhans and occurs only when these bodies are in part or wholly destroyed. Before his work, the link between the pancreas and diabetes was clear, but not the specific role of the islets.

The Nobel Prize committee in 1923 credited the practical extraction of insulin to a team at the University of Toronto and awarded the Nobel Prize to two men; Fredericus Bantam and J.J.R. Macleon. They were awarded the Nobel Prize in Physiology or Medicine in 1923 for the discovery of insulin. Bantam, insulted that Best was not mentioned, shared his prize with Best, and Macleon immediately shared his with James Collip. The patent for insulin was sold to the University of Toronto for one half-dollar.

The primary structure of insulin was determined by British molecular biologist Frederick Sanger. It was the first protein to have its sequence be determined. He was awarded the 1958 Nobel Prize in Chemistry for this work. In 1969, after decades of work, Dorothy Crowfoot Hodgkin determined the spatial conformation of the molecule, the so-called tertiary structure, by means of X-ray diffraction studies. She had been awarded a Nobel Prize in Chemistry in 1964 for the development of crystallography. Rosalyn Sussman Yalow received the 1977 Nobel Prize in Medicine for the development of the radioimmunoassay for insulin.