Com Substruct

ComSubstruct (ver. 0.5) - compute the longest common subsequence (LCS) of 5-tile code between two proteins. See also EXAMPLES (Prediction / Alignment / Others).

About ver. 0.2a : (1) support subsequence comparison (option "-s" and "-w").
Example: subsequence comparison,
Example: whole sequence comparison.
(2) support symbol '*' used in ".code" files created by NtileCodePredictor.

About ver. 0.3a : One could specify the number of best matches reported in the ".lcs" file, using the "-b" option. By default, five matches will be reported. (example)

About ver. 0.4 : One could specify the sort policy, using the "-p" option. By default, whole structure of the key protein is considerd when best results are selected (example). If "-p1" is specified, domain of the key protein is considered (example, see the <Best4> section).

About ver. 0.5 : Print out the frequency destribution of |LCS|/|target sequence| if "-p1" is specified.
(Example ("-p0") and example ("-p1"))

2008-04-19 : The ".code" file of ASTRAL dataset (1.73 95%) is available from the download section ("target_ASTRAL173.code").

2010-04-20 : The ".code" file of ASTRAL dataset (1.75 95%) is available from the download section ("target_ASTRAL175.code").

(1) ComSubstruct [-o option_number][-s][-w window_width][-p sort_policy]
[-b num_of_best_matches] key_protein.code target_proteins.code
(2) ComSubstruct -l [-v] [-o option_number][-s][-w window_width][-p sort_policy]
[-b num_of_best_matches] key_protein.code target_proteins.code
(3) ComSubstruct -h

Examples:

% ComSubstruct key_protein.code target_proteins.code
(Compute the LCS of "key protein" and each of "target proteins," using the naive Dynamic Programming algorithm. The output file contains all the LCSs, the frequency table of LCS lengths, and the LCS with the key protein and one of the most similar target proteins.)

% ComSubstruct -s key_protein.code target_proteins.code
(Compute the LCS of "key protein" and each of "target proteins," using the naive Dynamic Programming algorithm. Upon comparison, the shorter sequence is compared with subsequences of the longer sequence. One could specify the length of subsequences (width of compare window) by the "-w" option.)

% ComSubstruct -l key_protein.code target_proteins.code
(Compute the length of the LCS of "key protein" and each of "target proteins." The output file contains the frequency table of LCS lengths and the LCS between the key protein and one of the most similar target proteins.)

% ComSubstruct -l -v -o1 key_protein.code target_proteins.code
(Compute the length of the LCS of "key protein" and each of "target proteins," using Hirschberg's algorithm. The output file contains all the LCS lengths , the frequency table of the LCS lengths and the LCS between the key protein and one of the most similar target proteins)

(1) "key_protein.code" file : A ".code" file which specifies the key protein. example

The file should contain at least a pair of [Res.] and [Code] entries.
When there are more than one protein (a pair of [Res.] and [Code]) in the file, the first protein only is considered. The other pairs are omitted.

(2) "target_proteins.code" file : A ".code" file which specifies the target proteins. example

The file should contain at least a pair of [Res.] and [Code] entries.
When there are more than one protein (a pair of [Res.] and [Code]) in the file, LCSs between the key protein and each of the target proteins are computed.

[TIPS 1] About PDB files
Use ProteinEncoder to create ".code" files from PDB files.

[TIPS 2] About FASTA files
Use NtileCodePredictor to create ".code" files from FASTA files.

The three most challenging targets at the CASP7 is compared with all the 13005 sequences in the ASTRAL dataset (1.71 95%).
The program was run on a notebook compuer with 2 GHz Intel Core 2 Duo and 1GB 667 MHz DDR2 SDRAM:

Input
key	target

-l / -s / -w
options

Execution Time (sec.)
"-o0"	"-o1"	"-o2"	"-o4"

Output

2HA9
(length 397)

ASTRAL

26.4

1321.1

1970.2

1703.0

".lcs" file

2HA9
(length 397)

ASTRAL

"-l"

15.1

8.2

15.8

".lcs" file

2HA9
(length 397)

ASTRAL

"-l -s"

40.4

21.5

35.7

".lcs" file

2HA9
(length 397)

ASTRAL

"-l -s -w1.1"

75.4

40.4

65.3

".lcs" file

2HG6
(length 106)

ASTRAL

8.1

354.8

406.6

452.3

".lcs" file

2HG6
(length 106)

ASTRAL

"-l"

4.2

2.4

3.1

".lcs" file

2HG6
(length 106)

ASTRAL

"-l -s"

12.0

6.3

8.9

".lcs" file

2HG6
(length 106)

ASTRAL

"-l -s -w1.1"

22.2

11.2

16.3

".lcs" file

2IDB
(length 463)

ASTRAL

30.6

1566.4

2412.5

2028.5

".lcs" file

2IDB
(length 463)

ASTRAL

"-l"

17.2

9.7

19.8

".lcs" file

2IDB
(length 463)

ASTRAL

"-l -s"

53.3

28.4

51.4

".lcs" file

2IDB
(length 463)

ASTRAL

"-l -s -w1.1"

101.5

53.5

95.1

".lcs" file

* The CASP7 Home page: http://predictioncenter.org/casp7/.
* The ASTRAL homepage http://astral.berkeley.edu.

ComSubstruct reads two ".code" files, compute the LCS between the key protein and each of the target proteins, and outputs a ".lcs" file. By default, ComSubstruct computes LCSs, using the naive Dynamic Programing algorithm [1].

The following options are available:

-s

By default, ComSubstruct compares whole sequences. If "-s" is specified, ComSubstruct compares the shorter sequence with subsequences of the longer sequence.

-w window_width (float)

By default, the width of compare window (the length of subsequences of the longer sequence) is the product of "1.2" and the length of the shorter sequence. If "-w" is specified, the width is set to the product of "window_width" and the length of the shorter sequence. Note that "window_width" is the ratio of the width of compare window to the length of the shorter sequence:

the width of the compare window = "window_width times the length of the shorter sequence" residues,
the width of slide step = one residue if slide_width <= 0,
slide_width residues if slide_width > 0,

where slide_width = (int) "(window_width - 1.0) times the length of the shorter sequence."
In particular, the compare window is shifted by one residue if "window_width" is set to 1.0.

-l

By default, ComSubstruct computes LCSs. If "-l" is specified, ComSubstruct computes the length of LCSs only and it will finish the job more rapidly. Useful when "target_protein.code" file contains many proteins. (Recall that the ".lcs" file contains the frequency table of LCS lengths and the LCS between the key protein and one of the most similar target proteins.)

-v

By default, the length of LSCs between the key protein and each of the target proteins are NOT reported expect the length the LCS between the key protein and one of the most similar target proteins when "-l" is specified (nonverbal mode). If "-v" is specified, the length of each LSC is reported in the ".lcs" file (verbal mode).

-p sort_policy

By default, whole structure of the key protein is considerd when best results are selected. If "-p sort_policy" (sort_policy = 0 or 1) is specified, another policy is used:

(1) Set the sort_policy to 0 for whole structure search of the key protein (default). Then best results are selected based on the length of LCS with the length of target proteins for tie-break (the shorter, the better).
(2) Set the sort_policy to 1 for domain search of the key protein. Then best results are selected based on the ratio of (the length of LCS)/(the length of target protein) with the length of target proteins for tie-break (the longer, the better).

-b num_of_best_matches

>By default, ComSubstruct reports best five matches in the ".lcs" file. If "-b num" is specified, best "num" matches will be reported.

-o option_number

By default, ComSubstruct computes LCSs, using the naive Dynamic Programming algorithm [1]. If "-o option_number" (option number = 0, 1, ..., or 4) is specified, another algorithm is used:

(1) Set the option_number to 0 for Naive Dynamic Programming (default),
(2) Set the option_number to 1 for Hirschberg's algorithm [2],
(3) Set the option_number to 2 for Hirschberg's + Hunt and Szymanski's algorithm [3],
(4) Set the option_number to 3 for Wilbur and Lipman type heuristic algorithm [4] (NOT SUPPORTED),
(5) Set the option_number to 4 for Kuo, Chang, and Kuo's heuristic algorithm [5].

-v

If the "-h" option is specified, synopsis is shown.

[1] N. C. Jones and P. A. Pevzner. An Introduction to Bioinformatics Algorithm. The MIT Press, London, 2004.
[2] D. S. Hirshberg. A linear space algorithm for computing maximal common subsequences, Commun. ACM 18 (1975), no. 6, 341-343.
[3] J. W. Hunt and T. G. Szymanski, A fast algorithm for computing longest common subsequences, Commun. ACM 20 (1977), no. 5, 350-353.
[4] W. J. Wilbur and D. J. Lipman. Rapid similarity searches of nucleic acid and protein data banks. Natl. Acad. Sci. USA 80 (1983), 726-730.
[5] K.-S. Huang, C.-B. Yang, and K.-T. Tseng. Fast algorithms for finding the common subsequence of multiple sequences. Int. Computer Symposium, Dec. 15-17, 2004, taipei, Taiwan.