An application of information theory to genetic mutations and the matching of polypeptide sequences
Abstract
An information-based methodology for determining the quality of an alignment of two code sequences is presented. The assumptions involved in the procedure are as follows, (i) The information required to effect the alignment is separable into three categories: location, type and operation detail. The information basis of all three categories must be the same so that the information values obtained may be added together to produce a meaningful total for the entire alignment, (ii) All possible alignments may be expressed as composites of four mutation operations, UR, S, In and D. Two mutations are constrained from occurring at the same site to avoid ambiguity and to render the set of alignments finite, (iii) The character statistics and corresponding estimates of the probabilities of occurrence for mutations are available or at least estimable.
In application, one needs to obtain estimates of the distribution of (a) the spacing between mutations, (b) the frequency of the four mutation operations, and (c) the inserted character frequencies and deletion lengths. Some of the constraints on these estimates are described and means, in each case, for obtaining reasonable values are suggested. These requirements are all extremely fundamental in nature and can, in principle, be satisfied biochemically. The greatest potential value of the method, is that these physical quantities may be related in a non-arbitrary way to the complex problem of alignment. The method requires no arbitrary penalty factors and should help to guide geneticists in gathering the necessary data.- Publication:
-
Journal of Theoretical Biology
- Pub Date:
- 1973
- DOI:
- Bibcode:
- 1973JThBi..42..245R