[HOME]

[Prediction    (Basic ideas)]

About prediction of the 5-tile code sequence of amino acid sequences


Predictions of the 5-tile code sequence of amino acid sequences are made as follows.

(1) List of fragments and their 5-tile code sequences

First, the 5-tile codes of the ASTRAL SCOP 1.71 (95%) sequences [1] were computed to compile a list of all the pairs of a fragmentsof length five and its 5-tile code sequence occurred (“frag _code5 _full.tbl”). Each occurrence of a fragment has its own entry in the list. And there would be more than one entry of the same pair if the fragment occurred more than once and some of them assume the same 5-tile code sequence.


(2) Frequency distributions of the 5-tile codes

Next, for each fragment of length five occurred in a given amino acid sequence, we search the compiled list for the same fragment. If there exists a entry of the fragment, the occurrence of 5-tile codes in the 5-tile sequence is recorded for each position of the fragment. And we obtain frequency distributions of 5-tile codes for each amino acid of the sequence.


(3) Making a choice

If the most frequently occurred 5-tile code is determined uniquely, the code is chosen as the prediction. If there are more than one, the corresponding amino acid is marked as one with multiple candidate 5-tile codes without any guess.


(4) About other lists

Finally, note that the 5-tile codes of a fragment of length five are not uniquely determined by the fragment except the middle one because they depend on the position of the neighboring four amino acids, i.e., the neighboring fragments. To improve accuracy of prediction, we also consider two other lists of fragments: One is a list of all the pairs of a fragment of length five and the 5-tile code of the middle amino acid of the fragment (“frag_code5.tbl”). The other is a list of all the pairs of a fragment of length seven and the 5-tile codes of the middle three amino acids of the fragment (“frag_code7.tbl”).

In the case when we use multiple lists, the frequency of occurrence for each list is simply accumulated to create total frequency distributions of the 5-tile codes. For example, if you use two lists, “frag_code5full.tbl” and “frag_code5.tbl”, then occurrence of 5-tile codes is counted twice at the middle amino acid of each fragment.


[References]
  1. Chandonia JM, Hon G, Walker NS, Lo Conte L, Koehl P, Levitt M, Brenner SE. The ASTRAL compendium in 2004. Nucleic Acids Research 32:D189-D192 (2004).