Content and Format of Files
This section describes the format of input files for MASIA, which should all be in correct format and output files. If an input or output file is indicated in brackets, the extension name of the output file is generated by MASIA automatically.
The character "#" in the first column indicates the whole line is a comment and will not be read by the program. Each of the 20 amino acids is characterized according to a a particular property as a member of a numbered group.
# properties of amino acids # # G A V L I P S T D E N Q K R H F Y W M C glycine 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 G secondCF 4 3 1 2 1 2 3 3 2 3 1 3 4 1 4 4 4 2 2 1 4 HBT solvebtKD 3 3 1 1 1 1 3 3 3 2 2 2 2 2 2 2 1 3 3 1 1 hi hphobic 2 1 1 1 1 1 1 2 2 2 2 2 2 2 2 1 1 1 1 1 1 h aromatic 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 2 2 a size 3 1 1 3 3 3 3 1 3 3 3 3 3 2 2 3 2 2 2 3 3 sl inout 3 3 3 1 1 1 2 2 2 2 2 2 2 2 2 3 1 3 1 1 1 io polarity 4 1 1 1 1 1 1 2 2 4 4 2 2 3 3 2 1 2 2 1 2 np
For example, the property "hphobic" (hydrophobic) separates the amino acids into group 1, defined as hydrophobic (h)={G, A, V, L, I, P, H, F, Y, W, M, C} or group 2, non-hydrophobic. The property names in the default list are described in Appendix. To add a new property to the list, it must meet the FORTRAN format (A15,1X,I2,1X,20I2,1X,A3).
The probability that one amino acid will substitute for another in families of structurally related proteins is based on the results of Riesler et al., (1988) J. Mol. Biol. 204, 1019-1029. The numbers represent the relative frequency for the substitution of a certain amino acid by one of the remaining 19 amino acid in percent. The sum of the numbers in a row is 100%.
# the substitution matrix
G A V L I P S T D E N Q... ...
G 0.00 20.18 4.39 1.75 0.88 1.75 17.54 9.65 5.26 6.14 11.40 2.63
A 10.60 0.00 13.36 4.61 6.45 5.53 13.36 11.52 2.76 5.07 4.15 4.61
V 1.92 11.15 0.00 17.69 26.15 1.54 8.08 5.00 1.15 3.08 2.31 3.08
L 1.11 5.56 25.56 0.00 21.11 0.56 1.67 5.00 0.56 0.56 3.89 1.11
I 0.54 7.57 36.76 20.54 0.00 1.62 2.16 8.11 2.16 3.24 1.08 0.54
P 3.45 20.69 6.90 1.72 5.17 0.00 15.52 8.62 1.72 17.24 0.00 1.72
S 9.05 13.12 9.50 1.36 1.81 4.07 0.00 19.46 6.33 2.71 10.41 2.26
... ...
The whole exchange matrix is included in Appendix. In mutations in structural related proteins Ala is replaced by Gly in 10.6% of the cases, by Val, 13.36% etc. As expected, amino acids with similar structures and properties have higher exchange probabilities, e.g., Leu~Val: 25.56%, Leu~Ile: 21.11%, Ile~Val: 36.76%, Ile~Leu: 20.54%, while those with dissimilar structures have lower exchanging probabilities, e.g., Leu~Asp: 0.56%, Leu~Glu: 0.56%.
The first line lists the first sequence from the multiple alignment
file (*.msf or *.aln, see below). The following lines contain for each
property the name of the conserved properties, if they fulfil the criterion for
significance. If the criterion is not fulfilled, a blank occurs at the corresponding
column position. In the following example, those positions where the properties
turn4 (for turn-forming propensity) and alpha4 (for
-forming
propensity) are significantly conserved, are listed as T and H, respectively.
# the substitution matrix sequence GFPIPDPYCWDISFRTFYTIVDDEHKTLFNGILL.LSQADNADHLNELRRCTGKHFLNEQ ------------------------------------------------------------------------ alpha4 HHH H HH H HH H H H H HH HH .H H HH H H HH turn4 T T T TT T T TTT T T TT . T TT TT T T inout ioiooo ioooio i ooiooo ooiio ii .i ioo oi iioio i oo
The following shows the consensus (conserved symbolic) lists at different conservation levels (40%, 50%, 60%, 70%, 80%, 90%, and 100%) from analysis of multiple aligned sequences.
sequence GFPIPDPYCWDISFRTFYTIVDDEHKTLFNGILL.LSQADNADHLNELRRCTGKHFLNEQ ------------------------------------------------------------------------ consensus GFPIPEPY WDESFRTFY LDDEHKTLFNGIFA.L NNADNL L VT HFLDEE consensus GF IPEPY WDESFRTFY LD EHKTLFNGIF .L NNADNL L VT HF EE consensus GF IP PY WD SF TFY D EHK LFNGIF .L A NL L VT HF EE consensus GF IP PY WD SF FY D EHK LFNGIF .L A L T HF E consensus P P WD SF FY D EHK F F . A L HF E consensus P P W SF Y D EH F . L HF E consensus P P W F Y D H F . HF
The MASIA file contains the results of the evaluated rules and the summary of the predictions. The first and last lines are the first sequence from the multiple alignment file. The results for each rule are displayed on two lines. The first line contains the prediction for each individual residue position, which is made from conserved properties (as in the *.res file) based on the specific rule. The following line gives the result from the "x of y" criterion, i.e., if a property is conserved in more than x symbols out of y positions. In the following example the rules for alph require 3 out of 5 conserved positions, for turn 4 out of 4 positions, and for both of inside and outside 1 out of 1 position. Below the dashed line, a summary for inside/outside and secondary structure predictions is listed. Before the last line the secondary structure prediction (SSP) is displayed, which is based on hierachic analysis from the predicted results of secondary structure. In current version of MASIA, it simply uses turn>>a>b.
sequence GFPIPDPYCWDISFRTFYTIVDDEHKTLFNGILL.LSQADNADHLNELRRCTGKHFLNEQ
alph HHH H HH H HH H H H H HH HH H H HH H H HH
HHHHHHHHHHHHHHHHHH HHHHHHHHHHHHH HHHHHHHH HHHHH
turn T T T TT T T TTT T T TT T TT TT T T
insd i i i i i i ii ii i i i ii i i
i i i i i i ii ii i i i ii i i
outs o ooo ooo o oo ooo oo o oo o o o ooo
o ooo ooo o oo ooo oo o oo o o o ooo
---------------------------------------------------------------------
alph HHHHHHHHHHHHHHHHHH HHHHHHHHHHHHH HHHHHHHH HHHHH
turn
insd i i i i i i ii ii i i i ii i i
outs o ooo ooo o oo ooo oo o oo o o o ooo
SSP HHHHHHHHHHHHHHHHHH HHHHHHHHHHHHH HHHHHHHH HHHHH
sequence GFPIPDPYCWDISFRTFYTIVDDEHKTLFNGILL.LSQADNADHLNELRRCTGKHFLNEQ
In the following there are three different predictions for a-forming propensity which are made from three different rules. All of them specified in the same characteristic alph are grouped together to "average" the a-forming propensity over all the three rules, and the result is shown below the dashed line. For beta and turn, each has one rule specified in corresponding characteristic so that the predictions below the dashed line are kept the same.
sequence GFPIPDPYCWDISFRTFYTIVDDEHKTLFNGILL.LSQADNADHLNELRRCTGKHFLNEQ
beta BBBBB BBBBBBBBBBB BBBBBBBBBBBBBB BBBBBBBBBBBBBBBB
BBBBB BBBBBBBBBBB BBBBBBBBBBBBBB BBBBBBBBBBBBBBBB
alph HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
alph HHHHHHHHHHHHHHHHHHHHHHHHHHH HHHHHHHHHHHHHHH
HHHHHHHHHHHHHHHHHHHHHHHHHHH HHHHHHHHHHHHHHH
alph HHHHHHHHHHHHHHHHHHHHH HHHHHHHHHHHHHHHHHHHHHH
HHHHHHHHHHHHHHHHHHHHH HHHHHHHHHHHHHHHHHHHHHH
turn TTTTTT TTTT TTTTTTTT
TTTTTT TTTT TTTTTTTT
---------------------------------------------------------------------
beta BBBBB BBBBBBBBBBB BBBBBBBBBBBBBB BBBBBBBBBBBBBBBB
alph HHHHHHHHHHHHHHHHHHHHHHHHHHHH HHHHHHHHHHHHHHHHHHHHHH
turn TTTTTT TTTT TTTTTTTT
SSP TTTTTTHHHHHTTTTHHHHHHHHHHH TTTTTTTTHHHHHHHHHHHHHHHH
sequence GFPIPDPYCWDISFRTFYTIVDDEHKTLFNGILL.LSQADNADHLNELRRCTGKHFLNEQ
The topology file lists residue ranges of segments of secondary structure and inside/outside residues. All these structural information are taken from the result of predictions. In topology file *.top, the line with character "#" in the first column is considered as a comment which will be skipped by the program. The format for segments of secondary structure is (A4,1X,A2,I4,I4).
# topology file alph H1 1 12 alph H2 41 46 beta B1 13 21 beta B2 25 35 beta B3 47 55 insd 9 2 4 10 14 17 21 28 29 32 insd 8 33 35 38 44 47 48 50 55 outs 9 3 5 6 7 11 12 13 15 19 outs 9 20 22 23 24 26 27 30 39 40 outs 4 43 49 51 57
In the current version of MASIA, the format of *.top is portable to the program TRANSLATE to generate distance constraints and dihedral angle constraints for Self-Correcting Distance Geometry calculations with program DIAMOD.
This file has the same format as that of topology one. It is needed as input file of MASIA whenever command outcpr is to be used. The strucutral informations, i.e., secondary structure and inside/outside lists, are taken from X-ray or NMR experimental structure. Secondary structure is produced by hydrogen bonding recognition algorithm (Kabsch W and Sander C, Biopolymer, 1983, 22 2577-2637), and inside/outside list is determined by accessible surface area of the side chain atom. Residues are considered as "inside" if their solvent-accessible surface area in the tertiary structure is less than 20% of a "random coil" value, and as "outside" if their solvent-accessible surface area is more than 50% of this reference value. The "random coil" value of a residue X is the average solvent-accesible surface area of X in the tripeptide Gly-X-Gly in an ensemble of 30 random conformations. Accessible surface areas can be obtained with program ANAREA (Richmond TJ, 1984, J. Mol. Biol. 178, 63-89).
In addition to results contained in the MASIA file *.mas, *.cpr compares predicted results with experimental data (as specified in *.sec) below the dashed line.
sequence GFPIPDPYCWDISFRTFYTIVDDEHKTLFNGILL.LSQADNADHLNELRRCTGKHFLNEQ
alph HHH H HH H HH H H H H HH HH H H HH H H HH
HHHHHHHHHHHHHHHHHH HHHHHHHHHHHHH HHHHHHHH HHHHH
turn T T T TT T T TTT T T TT T TT TT T T
insd i i i i i i ii ii i i i ii i i
i i i i i i ii ii i i i ii i i
outs o ooo ooo o oo ooo oo o oo o o o ooo
o ooo ooo o oo ooo oo o oo o o o ooo
---------------------------------------------------------------------
insd i i i i i i ii ii i i i ii i i
exp. Str i ii ii ii ii i ii i ii i ii ii i i iii ii i
outs o ooo ooo o oo ooo oo o oo o o o ooo
exp. Str ooo oo o oo o oooo oo oo oo o o ooo ooo oo o o
SSP HHHHHHHHHHHHHHHHHH HHHHHHHHHHHHH HHHHHHHH HHHHH
exp. Str HHHHHHHHHHHHH HHH HHHHHHHHHHHHHHHHHHH
sequence GFPIPDPYCWDISFRTFYTIVDDEHKTLFNGILL.LSQADNADHLNELRRCTGKHFLNEQ