Structural Biochemistry/SitePrediction

Introduction into Proteinase Cleavage Sites
Proteinases are important enzymes that hydrolyze peptide bonds in amino acids of proteins. They make up about 2% of all gene products and are significant in biotechnology and medicine because their effector function can be easily targeted by small peptide-based inhibitors. Many human diseases are caused by the malfunction of proteolytic activity, which can have very devastating consequences.

Proteinases can be classified into six types according to their catalytic function. These six types include serine (S), cysteine (C), threonine (T), aspartic acid (D), glutamic acid (E) and metallo catalytic types. Another type of classification is seen on the basis of the kind of reaction they catalyze. These include endoproteinase, which hydrolyze internal alpha-peptide bonds in a polypeptide; and exoproteinase which require a free N-terminal group, C-terminal group or both in order to hydrolyze a peptide bond that is not more than 3 residues from the terminus. Endoproteinases and exoproteinases depend on pH. Exoproteinases are usually involved in the degradative processes such as food digestion, proteasome phagocytosis, and proteasomal digestion. Endoproteinases are usually highly specific for target sequences.

In order for proteins to be cleaved, they must first bind to the active site of the proteinase. For those proteins that are involved in biological processes, the cleavage either activates, inactivates, or modifies the substrate in some pathway. The active site of proteinase has conterminous S pockets which will accommodate several consecutive amino acids from the substrate. The P1 site is termed as the amino acid that is C-terminally cleaved. Additional sites are N-terminal from P1 (labeled as P2,P3,etc.) which are accommodated by corresponding S sites in the catalytic pocket of the proteinase. For some proteinases, the specificity of the substrate is defined by the number of subsites in the active site and by the size, shape and charge of the side chains. The 3-D structure of the substrate guides the compatibility between the S sites in the substrate binding pocket.

Cleavage Site Prediction Tools
There are several tools that can be used to predict the possible substrates for the proteinases and most likely locations of cleavage. One tool is called PoPS which is used for modeling and predicting proteinase specificity. PeptideCutter is used for prediction of potential cleavage sites for proteases. GraBCas has been used for prediction of the sites cleaved by granzymeB and capsases. CASPredictor for caspase substrate prediction. GraBCas and CaSPredictor are two tools that are specific for certain proteins, making them limited in applicability. Other tools are very complex and have a really troublesome user interface. Therefore, the development of 'SitePrediction' has provided researchers with a user-friendly tool that helps to predict possible cleavage sites in specific substrates based on the cleavage sites that were found in literature or experiments. Other features have also been invested in SitePrediction, such as secondary structure prediction, solvent accessibility and PEST structure prediction.

The main goal of all of these tools is to predict the location of cleavage sites in candidate substrates. There are three main steps in determining the applicability and reliability of these tools. First is the defining of which protease specificity should be utilized and which protein sequence needs to be analyzed. Second, the reliability of the prediction can be determined from the different calculation methods. Third, prediction value can be increased with the secondary features that are implemented in the SitePrediction.

User Input
In terms of the first step of the three main steps in determining applicability of these tools, two main decisions have to be made about the input itself: what is the protease of interest and which substrate should be used to predict possible cleavage sites? Some tools allow the user to enter the substrate sequences one by one in FASTA form, such as GraBCas, POPS, and PeptideCutter while other tools make the entry of a list of FASTA formats such as CaSPredictor and SitePrediction. SitePrediction allows the user to insert a list of common identifications such as SwissProt identifiers.

However, one of the most important steps in cleavage site prediction is to define the specificity of the chosen protease. GraBCas and CaSPredictor only allow for certain proteases such as granzyme B and some caspases. PeptideCutter gives more room for options but only offers the use of fixed consensus sites. PoPS allows for the choice of a list of predefined protease. SitePrediction allows for the entry of 'known cleavage sites' from a list that is based on literature and experiments from database. These cleavage site-lists are used to calculate a relevant statistics profile. Cleavage site analysis can be seen through SiteProduction in two ways: a logo and a histogram.

Scoring Methods
The distinction among the several tools mentioned is the scoring method for prediction of cleavage sites in a substrate. The simplest tool (PeptideCutter) looks for occurrences of fixed consensus cleavage sites in the substrate. All of the other tools mentioned uses a frequency score that can indicate whether the amino acids of the cleavage sites are likely to happen at that position. The score will contain errors though. One example can be seen in CaSPredictor when it adds the frequency of each position instead of multiplying them. SitePrediction also uses a secondary score that is based on the similarity of the cleavage sites. This is done usually through BLOSUM 62 or other substitution matrices.

Extra Features
The extra features such as statistics, PEST sequences, solvent accessibility, and secondary structure are also important parts of the tools. SitePrediction and PoPS offer some of these extra features besides the common prediction of cleavage sites in substrates. These extra features are important in helping out the prediction mechanism of the cleavage sites and adding additional information. One important feature in SitePrediction is the statistics calculation. This gives the user a better look into the quality of the input sites and threshold scores the predicted cleavage sites by comparing the scores of experimentally known sites with those of the ones being tested or the random sequences. In addition, both SitePrediction and PoPS give extra predictions such as PEST region. These are regions that are rich in P, E, S, and T that could show that they are more susceptible to proteolysis and may affect the cleavage site prediction results by forming unstructured loops. An additional feature is added in the SitePrediction by integrating the SSPro package, which will help predict the solvent accessibility and secondary structures.

Statistics
The SitePrediction tool was used to estimate the quality of the prediction with its statistical features. It was ran against input cleavage sites of both caspase-3 and calpain-2. An effective means of evaluating the SitePrediction extra feature was with the ROC curve (receiver operating characteristic). It is defined as the plot of the sensitivity versus its false positive rate. The accuracy of the ROC curve is measured by the AUC which is the area under the curve. The values of caspase-3 and calpain-2 came out to be 0.995 and 0.951, which shows that the calculations were very accurate under the input sites.

PEST Analysis
PEST analysis extra feature of SitePrediction was applied on the predicted substrates of caspase-3 and calpain-2, granzyme-B and cathepsin-D. The percent of amino acids in PEST regions was calculated for all of the substrates and they were compared to the translated transcripts of the human genome found currently in the NCBI. In theory, the human proteome has 3.46% of all amino acids that are calculated to be in the PEST sequences. By observing the percentage of the substrate that contain PEST sequences, a clear difference is revealed in that only 33% of the human proteins are predicted to contain PEST sequences. Compared to 58%, 50%, 58%, and 39% of the caspase-3, calpain-2, granzyme-B, and cathepsin-D substrates, respectively.



Solvent Accessibility and Structure Prediction
For caspase-3, a percentage of 53.9% of all amino acids of the known substrates were predicted to be 'exposed.' This is compared to 68.15% for the experimental cleavage sites. The predictions for the substrates of calpain-2 and granzyme-B showed an increase of exposed amino acid residues while cathepsin-D results revealed a minimal decrease. This meant that the cleavage sites may be available in protein regions that are more readily accessible to solvent than the rest of the substrate. However, when the solvent accessibility is used as an extra factor, caution must be taken in analysis of known substrates.

SitePrediction also has a secondary structure prediction feature that can determine whether the presence or absence of a secondary structure would play a role in the prediction of potential cleavage sites. The cleavage sites of caspase-3 and granzyme-B are situated in the unstructured sequences while that of calpain-2 is present equally in structured and unstructured. On a different note, cathepsin-D cleavage sites are present in structured regions. All of these observations are in agreement with the functions of these different proteases. This, in turn, could be a determining factor in the prediction of the likelihood of potential cleavage sites.