NAME
ComSubstruct (ver. 0.5) -
compute the longest common subsequence (LCS) of 5-tile code between two
proteins. See also EXAMPLES (Prediction / Alignment / Others).
About
ver. 0.2a :
(1) support subsequence comparison (option "-s" and "-w").
Example:
subsequence
comparison,
Example:
whole
sequence comparison.
(2) support symbol
'*' used in ".code" files
created by NtileCodePredictor.
About
ver. 0.3a : One could specify the
number of best matches reported in the ".lcs" file, using the "-b"
option. By default, five
matches
will be reported. (
example)
About
ver. 0.4 :
One could specify the sort policy, using the "-p" option. By default,
whole structure of the key protein is considerd when best results are
selected (
example).
If "-p1" is specified, domain of the key protein
is considered (
example,
see the <Best4>
section).
About
ver. 0.5 :
Print out the frequency destribution of |LCS|/|target sequence| if
"-p1"
is specified.
(
Example ("-p0") and
example ("-p1"))
2008-04-19
:
The ".code" file of ASTRAL dataset (
1.73
95%) is
available from the download section ("target_ASTRAL173.code").
2010-04-20
:
The ".code" file of ASTRAL dataset (1.75
95%)
is
available from the download section ("target_ASTRAL175.code").
DOWNLOADS (Free for
academic use)
SYNOPSIS
(1) ComSubstruct
[-o
option_number][-s][-w
window_width][-p
sort_policy]
[-b
num_of_best_matches]
key_protein.code
target_proteins.code
(2) ComSubstruct -l
[-v] [-o
option_number][-s][-w
window_width][-p
sort_policy]
[-b
num_of_best_matches]
key_protein.code target_proteins.code
(3) ComSubstruct -h
Examples:
% ComSubstruct
key_protein.code target_proteins.code
(Compute the LCS of "key protein" and each of "target proteins," using
the naive Dynamic Programming algorithm. The output file
contains all
the LCSs, the frequency table of LCS lengths, and the LCS
with
the key protein and one of the most similar target proteins.)
% ComSubstruct
-s
key_protein.code target_proteins.code
(Compute
the LCS of "key protein" and each of "target proteins," using the naive
Dynamic Programming algorithm. Upon comparison, the shorter sequence is
compared with subsequences of the longer sequence. One could specify
the length of subsequences (width of compare window) by the
"-w" option.)
% ComSubstruct -l key_protein.code
target_proteins.code
(Compute the length of the LCS of "key protein" and each of "target
proteins." The output file contains the frequency table of LCS lengths
and the LCS between the key protein and one of the most similar target
proteins.)
% ComSubstruct -l -v -o1
key_protein.code target_proteins.code
(Compute the length of the LCS of "key protein" and each of "target
proteins," using Hirschberg's algorithm. The output file
contains all
the LCS lengths , the frequency table of the LCS lengths
and the
LCS between the key protein and one of the most similar target
proteins)
INPUT
(1)
"key_protein.code" file : A ".code" file which specifies the key
protein.
example
The file should contain at least a pair of [Res.] and [Code] entries.
When there are more than one protein (a pair of [Res.] and [Code])
in the file, the first protein only is considered. The other pairs are
omitted.
(2)
"target_proteins.code" file : A ".code" file which specifies the target
proteins.
example
The file should contain at least a pair of [Res.] and [Code] entries.
When there are more than one protein (a pair of [Res.] and [Code])
in the file, LCSs between the key protein and each of the target
proteins are computed.
[TIPS 1]
About PDB files
Use
ProteinEncoder
to create ".code" files from PDB files.
[TIPS 2]
About FASTA files
Use
NtileCodePredictor
to create ".code" files from FASTA files.
OUTPUT
(1)
"key_protein_target_protein.lcs" file : A ".lcs" file which contains
the frequency
table of LCS lengths and the LCS between the key protein
and some of the most similar target proteins.
Example1 (default), Example2 ("-l"),
Example3 ("-l" & "-v")
The
LCS is aligned with the key and the target proteins respectively.
One could specify
the number of the best LCSs
reported in the ".lcs" file, using the "-b" option.
It also contains all the LCSs
between the key protein and each of the target proteins if the "-l"
option is NOT specified. When the "-l" option is specified, all the LCS
lengths are given only if the "-v" option is also specified.
PERFORMANCE
The three most challenging
targets at the CASP7 is compared with all the
13005 sequences in
the
ASTRAL
dataset (
1.71 95%).
The program was run on a notebook compuer with
2 GHz Intel Core 2 Duo
and
1GB 667 MHz DDR2
SDRAM:
|
|
Execution
Time (sec.) |
"-o0" |
"-o1" |
"-o2" |
"-o4" |
|
|
* The CASP7 Home page:
http://predictioncenter.org/casp7/.
* The ASTRAL homepage
http://astral.berkeley.edu.
DESCRIPTION
ComSubstruct reads two
".code" files, compute the LCS between the key protein and each of the
target proteins, and outputs a ".lcs" file.
By default, ComSubstruct computes LCSs, using the naive Dynamic
Programing algorithm [1].
The following options are available:
-s
By default, ComSubstruct
compares whole sequences. If "-s" is specified, ComSubstruct compares
the shorter sequence with subsequences of the longer sequence.
-w window_width
(float)
By default, the width
of compare window (the length of subsequences of the longer
sequence)
is the product of "1.2" and the length of the shorter sequence. If "-w"
is specified, the width is set to the product of "window_width" and the
length of the shorter sequence. Note that "window_width" is the ratio
of the width of compare window to the length of the shorter
sequence:
the width of the compare window = "window_width times the
length of the shorter sequence" residues,
the width of slide step = one residue
if slide_width <=
0,
slide_width residues
if
slide_width >
0,
where
slide_width = (int) "(window_width - 1.0) times the
length of the shorter sequence."
In particular,
the compare window is shifted by one residue if
"window_width" is set to 1.0.
-l
By default, ComSubstruct
computes LCSs. If "-l" is specified, ComSubstruct computes the length
of LCSs only and it will finish
the job more rapidly. Useful when "target_protein.code" file contains
many proteins. (Recall that the ".lcs" file contains
the frequency table of LCS lengths and the LCS between the key protein
and one of the most similar target proteins.)
-v
By default, the length of
LSCs between the key protein and each of the target proteins are NOT
reported expect the length the LCS between the key protein and one of
the most similar target proteins when "-l" is specified (nonverbal
mode). If "-v" is specified, the length of each LSC is reported in the
".lcs" file
(verbal mode).
-p sort_policy
By
default, whole structure of the key protein is considerd when best
results are selected. If "-p sort_policy" (sort_policy = 0 or 1) is
specified, another policy is used:
(1) Set the sort_policy to 0 for whole structure
search of
the key protein (default). Then best results are selected based on the length of LCS
with the length of target proteins for tie-break (the shorter,
the
better).
(2) Set the sort_policy to 1 for domain search of
the key
protein. Then best results are selected based on the ratio of (the length of LCS)/(the
length of target protein) with the length of target
proteins for tie-break (the longer, the
better).
-b
num_of_best_matches
>By default,
ComSubstruct
reports best five matches in the ".lcs" file. If "-b num"
is specified, best "num" matches will be reported.
-o option_number
By default, ComSubstruct
computes LCSs, using the naive Dynamic Programming algorithm [1]. If
"-o option_number" (option number
= 0, 1, ..., or 4) is specified,
another algorithm is used:
(1) Set the option_number to 0
for Naive
Dynamic Programming (default),
(2) Set the option_number to 1
for Hirschberg's
algorithm [2],
(3) Set the option_number to 2
for Hirschberg's
+ Hunt and Szymanski's
algorithm [3],
(4) Set the option_number to 3
for Wilbur
and Lipman type heuristic
algorithm [4] (NOT SUPPORTED),
(5) Set the option_number to 4
for Kuo,
Chang, and Kuo's heuristic
algorithm [5].
-v
If the "-h" option is
specified, synopsis is shown.
REFERENCES
[1]
N. C. Jones and P. A. Pevzner. An Introduction to Bioinformatics
Algorithm. The MIT Press, London, 2004.
[2] D. S.
Hirshberg. A linear space algorithm for computing maximal common
subsequences, Commun. ACM 18 (1975), no. 6, 341-343.
[3] J. W.
Hunt and T. G. Szymanski, A fast algorithm for computing longest common
subsequences, Commun. ACM 20 (1977), no. 5, 350-353.
[4] W. J.
Wilbur and D. J. Lipman. Rapid similarity searches of nucleic
acid and protein data banks. Natl. Acad. Sci. USA 80 (1983), 726-730.
[5] K.-S.
Huang, C.-B. Yang, and K.-T. Tseng. Fast algorithms for finding the
common subsequence of multiple sequences. Int. Computer Symposium, Dec.
15-17, 2004, taipei, Taiwan.