INQUA Working Group on Data-Handling Methods

Newsletter 8: July 1992

SEQUENCE COMPARISONS AND SEQUENCE-SLOTTING

Malcolm Clark
Dept. of Mathematics
Monash University
Clayton, Victoria
Australia, 3168
Email: rmc@monu1.cc.monash.edu.au

1. Introduction. Many investigations involve the comparison, cross-correlation or amalgamation of two or more sequences of measurements. Examples of such sequences are geophysical well-logs, palaeomagnetic measurements from lake sediment cores, and pollen sequences. An essential feature is that observations within each sequence are ordered, say by increasing depth down a core, or by increasing age.

As an example, consider the comparison of pollen data obtained from two cores, say core A and core B. Each core is sub-divided into a number of small samples or horizons, numbered A1, A2,..,Am in core A and B1, B2,..,Bn in core B. The numbering is in the same direction, say from top to bottom, in both cases. At each of the (m + n) horizons, a count is made of the various types of pollen found. If the cores are from the same lake, then the pattern of variability in the pollen counts should be similar in both cores. If this is the case, this common pattern can best be estimated by combining the sequences into a single sequence.

In general, given two ordered sequences of observations, the objectives may be: (1) to measure the difference between the sequences, or (2) to identify matching parts of the sequence, or (3) to combine the two sequences into a single joint sequence.

A general strategy for achieving these objectives is first to define a measure of dissimilarity or distance d between any two objects (i.e. horizons, in the case of pollen analysis) from either core. The objects A1, A2,.., Am and B1, B2,.., Bn may thus be represented as points in some space of k dimensions. One intuitively appealing measure of the difference or distance between two sequences is the total distance or "combined path length" (CPL) through the combined sequence of A's and B's. For any such sequence, the CPL is obtained by adding the distances between adjacent objects (i.e. horizons) in the combined sequence. The problem then is to combine or to slot the two sequences into a single combined sequence, preserving the ordering within the A's and B's, in such a way that the CPL is minimised. The resulting minimum CPL, or some transformation of it, is a measure of the difference between the two sequences, while the corresponding optimum combined sequence gives the best amalgamation of the information in the original sequences.

For example, one possible combined sequence for the case m = 4 and n = 3 would be

A1 A2 B1 A3 B2 A4 B3,

for which the

CPL = d(A1, A2) + d(A2, B1) + d(B1, A3) + d(A3, B2) + d(B2, A4) + d(A4, B3),

where d( , ) denotes the distance between the specified horizons. Notice how the A's and B's are in their correct order.

An alternative approach, applicable when each object Ai or Bj can take only a few possible values, is to align the two sequences side-by-side, rather than slotting them together into a single sequence. If the numbers of A's and B's are unequal, there will necessarily be some gaps in either sequence. For example, one possible alignment with m = 4 and n = 3 would be

A1 A2 . A3 A4 . B1 B2 B3 .

where the dots indicate gaps. This approach arises naturally when comparing DNA sequences and the like, but the sequence-slotting technique seems more appropriate when the variables of interest are on a continuous scale.

2. Methodology. The optimum combined sequence (in the sense of minimum CPL) may be easily found by dynamic programming techniques. The basic algorithm for this and similar problems was discovered independently by various workers in the early 1970's, e.g. Delcoigne and Hansen (1977), Sakoe and Chiba (1978), Needleman and Wunsch (1970). The sequence-slotting algorithm uses an array of 2mn numbers, which is built up iteratively using essentially two equations. (See Clark (1985), Gordon (1980) for details, and Sankoff and Kruskal (1983) for a broad overview).

The numerous dynamic programming algorithms may be divided into two categories, depending on whether the sequences are to be aligned or slotted. There is an enormous literature on techniques for the alignment of sequences (such as DNA sequences), but very little on sequence-slotting as described above.

3. Additional Constraints. Sometimes the combined sequence must take account of additional constraints based on stratigraphic information. For example, each core may contain one or more "markers" which must match up in any combined sequence. There could be say a distinct band of sediment at position A5 in core A, but at position B11 in core B. Then in any combined sequence A5 and B11 must be immediately adjacent.

The algorithms of Gordon (1980) and Clark (1985) allow for a variety of such additional stratigraphic constraints. For example, users may specify that certain parts of one sequence must overlap a particular subset of the other sequence, or that part of one sequence must precede part of the other, or that part of the final sequence must contain elements from just one sequence only.

On the other hand, sometimes the optimal combined sequence may contain long blocks of consecutive A's or B's. This is likely to happen in parts of the sequence where the response variables (e.g. pollen distribution) are nearly constant. In such cases, these long blocks do not necessarily represent a significant difference between the relevant parts of the two sequences. The CONSLOT and PCSLOT programs mentioned below enable the user to specify the maximum number of consecutive A's or B's (i.e. the maximum block-length) to be permitted.

4. Problems. Experience has shown that the sequence-slotting procedure depends heavily on the choice of distance measure, and on any preliminary scaling of the measurements. It is advisable to try alternative distance measures, e.g. "city-block" metric versus Euclidean distance, and to scale the original data to ensure that all variables are in comparable units of measurement.

The solution to the sequence-slotting problem need not necessarily be unique. There may be many alternative combined sequences all with the same minimum CPL especially when only one variable or characteristic is measured on each object Ai or Bj. The various algorithms guarantee to find only one of the multiple solutions.

In principle, the dynamic programming algorithms can be extended to solve the problem of the simultaneous slotting of k sequences into a single combined sequence. In practice, the amount of computer time and memory required increases so rapidly that only the above case of k = 2 is feasible.

Thompson and Clark (1989) give an overview of the practical problems associated with stratigraphic correlation using sequence-slotting.

5. Assessment of results. All sequence-slotting algorithms will produce an answer, no matter what data they are applied to. But does the answer make sense? Questions to ask are:

Questions (1) and (2) are answered to some extent by the so-called "H-matrix" devised by Gordon et al. (1988). This displays graphically both the optimal slotting(s) and some of the slottings which are worse than it. It also indicates which parts of the slotting are "tight", where a minor change in the combined sequence would produce a large change in the CPL.

Question (3) may be answered by the simple device of leaving one observation out at a time, seeing what happens, and repeating this all the way down either sequence. In principle, this sensitivity analysis requires (m + n) passes through the data, but Gordon et al. (1988) show how it can be done in a single pass, at the same time as computing the H-matrix.

There are two proposed statistical tests of the hypothesis implied in (4) that both sequences show the same basic pattern in the variables of interest but on possibly different time-scales. Both methods involve computer simulation. Gordon (1982) suggested simulating data, with interpolation, from one sequence, and slotting the simulated data against the other sequence. Repeated simulations give an indication of the mean and standard deviation of the CPL. Clark (1989) proposed a randomisation test in which each sequence is split at random into two sub-sequences, A1 and A2 from sequence A, and B1 and B2 from sequence B. The idea is that each random sub-sequence is slotted against every other sub-sequence, and the whole procedure is repeated say 40 times. If the optimum CPLs for the "between-sequence" comparisons (e.g. A1 versus B1) are consistently bigger than those for the "within-sequence" comparisons (e.g. A1 versus A2), then this suggests that the original sequences are not compatible. This test appears to be very sensitive, but is difficult to implement when there are additional stratigraphic constraints.

6. Computer programs. There are numerous computer programs dealing with some form of sequence comparison. I know of only a few which deal with the particular problem of sequence slotting as discussed here. These are based on Allan Gordon's SLOTSEQ program published in 1980.

SLOTSEE, produced by Lou Maher, is a PC-based program which is easy-to-use, produces very effective graphical output, and is designed particularly for the analysis of pollen sequences.

I have written CONSLOT (now in Version 7.3) which allows for 12 different types of stratigraphic constraints as well as block-length constraints, and produces both the H-matrix and sensitivity analysis of Gordon et al. (1988). Written in Fortran to run on a mainframe computer, it can readily handle sequences containing up to 500 objects each. The output is largely non-graphical. I now have a PC-version, known as PCSLOT, which retains all the features, is easier to use, but at present is limited to about 100 objects per sequence.

References

Clark, R.M. 1985. A Fortran program for constrained sequence-slotting based on minimum combined path length. Computers & Geosciences 11, #5, 605-617.

Clark, R.M. 1989. A randomization test for the comparison of ordered sequences. Math. Geol. 21, #4, 429-442.

Delcoigne, A. and Hansen, P. 1975. Sequence comparison by dynamic programming. Biometrika 62, 661-664.

Gordon, A.D. 1980. SLOTSEQ: A FORTRAN IV program for comparing two sequences of observations. Computers & Geosciences 6, 7-20.

Gordon, A.D. 1982. An investigation of two sequence-comparison statistics. Austral. J. Statist. 24, 332-342.

Gordon, A.D., Thompson, R. and Clark, R.M. 1988. The use of constraints in sequence-slotting. Data Analysis and Informatics, V (ed E. Diday), North Holland. pp.353-364.

Needleman, S.B. and Wunsch, C.D. 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol., 48, 443-453.

Sakoe, H. and Chiba, S. 1978. Dynamic programming algorithm optimization for spoken word recognition. IEEE Transactions for Acoustics, Speech and Signal Processing ASSP-26, 43-49.

Sankoff, D. and Kruskal, J.B. 1983. Time Warps, String Edits and Macromolecules: The Theory and Practice of Sequence Comparison. Addison-Wesley.

Thompson, R. and Clark, R.M. 1989. Sequence slotting for stratigraphic correlation between cores: theory and practice. J. Paleolimnology, 2, 173-184.


Copyright © 1992 Malcolm Clark
Home page
Newsletter 8 index
Author index
Subject index
WWW pages by K.D. Bennett