Method Description
This section summarizes the fundamental principles needed to understand for more complex applications.
As indicated earlier, each of the 20 amino acids is grouped into subgroups based on its specific chemical and/or physical character. In Appendix, we list properties included in the library used to distinguish the side chains from another with a brief description; the exact grouping of amino acid according to each property is also tabularized in Appendix. The user may select which properties to use to discriminate the secondary structure tendencies of the sequence, or choose one of the macros for default combinations. MASIA checks the amino acid at each position in the multiple aligned sequence matrix for their subgroup according to the selected properties. The user may select one of 3 methods implemented in the current version of MASIA (dominant criterion, art=d; probability entropy criterion, art=e, or the statistical expectation criterion, art=s) to determine a consensus subgroup for the column in the sequence alignment according to the property.
The dominant criterion simply counts the number of times an amino acid in a column falls into each subgroup of a property. If the number for any subgroup is above a certain cut-off percentage, the column is summarized as belonging to that subgroup. If no one subgroup exceeds this cutoff, the column is left as undesignated according to the property. Alternatively, the actual fraction of each subgroup for the property can be listed. Note that when using the dominant criterion, choosing a cut-off close to 1 will make the selection for a conserved property very selective; the lower the cut-off, the less the selectivity for conservation.
The statistical expectation criterion compares the
prevalence of amino acids falling into each subgroup with their average
relative substitution rates, based on the exchange frequency table of
Appendix. The theoretical expection value
(
) for finding an amino acid
A at a column in the alignment can be calculated from the
sequence similarities qk, in percent of identical
amino acids to the test sequence, of the homologous sequences
k=1,2, 3, ..., n relative to the first sequence as
follows:
![]() |
(1) | ||
![]() |
(2) |
Eq. (1) applies if the amino acid is the same as that in the first sequence, and means that the theoretical expectation for finding the same amino acid at any position in the sequence is equal to the overall degree of identity of the sequences. If a different amino acid B is found in the sequence, Equation 2 states that the expectation that B will substitute for A is a function both of the probability mBA for substituting the amino acid B with the amino acid A as read from the library for exchanging amino acid: exch.lib and the overall homology of the two sequences. The expectation value for a subgroup of amino acids is calculated by the sum of the expectation values of the individual amino acids. Note however that the assymetry of the amino acid exchange table (i.e., the frequency with which Gly is substituted to Ala is twice that of the reverse substitution) means that the secondary structure prediction may be quite different depending on which sequence from the alignment is taken as the test sequence. However, tests with rotating the sequence order did not indicate that this had a major effect on the outcome of the prediction.
The probability entropy criterion is most useful when a property has many, e.g., m, subgroups. The entropy is then:
; |
![]() |
(1) |
where pk is the relative frequency of a group, determined by the frequency of occurrence of each subgroup divided by the number of sequences. The maximal entropy is equal to ln m, for example for 3 subgroups, Smax=1.099. To determine whether a subgroup can be defined for a column, if the determined entropy is smaller than a predefined fraction of Smax, the criterion is fulfilled. Note that the smaller one sets the cut-off value with the entropy criterion, the higher the number of consistent sequences must be to allow the column to be considered conserved. This is the opposite of what one sees with the dominant criterion.
Rules combine properties in order to define secondary structure elements in the sequence. For example, when using the second criterion, if the helical subgroup occurs more frequently in a column than the theoretical expectation value indicates, this subgroup is statistically favored at this postion. Moving along the consensus sequence row, if 3 out of 5 contiguous positions in the sequence have conserved helical elements, MASIA will describe this area as a helix.
MASIA searches aligned sequences in both vertical and horizontal steps. Unless explicit definitions are provided, default parameters hstep=1 and vstep=1 are assigned automatically. The step command allows specification of how to look for conservation with respect to the desired property. The check goes from a property to the next one according to the definitions of horizontal and vertical steps, which go along sequence and property directions, respectively. The step command always follows a set of property commands, and the number of step commands in a characteristic may not exceed the number of properties under the group.
The group checks the properties based on the specified steps. If no step is provided, the "x of y" critirion will be used to obtain the result from a given rule. Therefore, a typical commands block is like
group
property 1 ... ...
property 2 ... ...
step
step
The characteristic determines the weighted average for the results obtained by using different rules. Thus, a typical commands block is like
characteristic ... ...
group ... ...
property 1 ... ...
property 2 ... ...
property 3 ... ...
step
step
group ... ...
property 1 ... ...
property 2 ... ...
step
The method used in MASIA can be illustrated as the following Figure.
