INQUA Working Group on Data-Handling Methods

Newsletter 9: January 1993

ASSESSMENT OF SEQUENCE-SLOTTING

Malcolm Clark
Department of Mathematics
Monash University
Clayton, Victoria
Australia, 3168
E-mail: rmc@monu1.cc.monash.edu.au

1. Introduction. In the previous Newsletter (Clark, 1992), I discussed the problem of combining, in an optimal fashion, two ordered sequences of data into a single combined sequence, subject to various order constraints.

For example, the sequences could correspond to pollen data from two cores A and B, with pollen counts made at horizons A1, A2, ..., Am in Sequence A and at horizons B1, B2, ..., Bn in Sequence B. A measure of dissimilarity (or "distance") is computed between the pollen counts at any two horizons from either core. Then the total dissimilarity (or "combined path length" CPL) for any proposed combined sequence of A's and B's is obtained by adding the distances between adjacent or consecutive horizons in the combined sequence. The problem then is to find the best combined sequence, i.e. the one with minimum total dissimilarity or minimum CPL.

This optimal combined sequence (and corresponding minimum CPL, denoted here by C*) can be readily found using algorithms based on dynamic programming techniques. Since there are only a finite number of possible combined sequences, all these sequence-slotting algorithms will produce an answer, no matter what data they are applied to. This means that the results of any algorithm need to be assessed carefully. Questions which may be asked are:

  1. Is there a unique optimal slotting?
  2. Which parts of the combined slotting are most reliable, and which are least reliable?
  3. Is the optimal slotting greatly superior to other possible ones?
  4. Do some individual horizons have a big influence on the outcome?
The first three questions are answered to some extent by the so-called "H-matrix", devised by Gordon et al. (1988), which gives a graphical representation of the optimal slotting(s) and some alternative slottings.

2. Geometric Interpretation. To understand the H-matrix, we first note that any combined sequence or slotting may be represented geometrically as a path across a certain (m + 1) * (n + 1) grid. This grid has (m + 1) rows, labelled 0,1,2,... up to m, numbered from the top down, corresponding to the horizons in Sequence A. Similarly, it has (n + 1) columns, labelled 0,1,2,... up to n, numbered from left to right, corresponding to the horizons in sequence B. Each possible slotting can be represented by a path across the grid, starting at the top left (or north-west) corner, with co-ordinates (0,0), and ending at the bottom right (or south-east) corner, with co-ordinates (m,n). The path is constructed step-by-step, according to the following rules.

  1. At any stage, the next segment of the path is either one step down or one step right from the current point.
  2. If the next horizon in the combined sequence is an A, the path goes down one step; if the next horizon is a B, it goes right one step.
If the path goes through the point on the grid with co-ordinates (j,k) (the intersection of j-th row and k-th column), the first (j + k) elements in the corresponding combined sequence comprise A1, A2, ... up to Aj and B1, B2, ... up to Bk. Furthermore, Aj and Bk must be immediately adjacent.

The combined sequence may start with a block of consecutive A's, or a block of B's. This is why the grid contains an extra row, labelled 0, and an extra column, labelled 0.

With this method of construction, any slotting may be represented by a path, and any path corresponds to a possible slotting (provided the path obeys rules (1) and (2) above). For example, the combined sequence

A1 A2 B1 A3 B2 A4 A5 B3

is represented geometrically by the following path.


Text figure
3. The H-matrix. The H-matrix essentially assigns a numerical value to each point on the above (m + 1) * (n + 1) grid. For each combination (j,k), we define H*(j,k) as the minimum CPL subject to the additional constraint that the path passes through the point (j,k) on the grid. In other words, in the corresponding combined sequence, horizons Aj and Bk must be immediately adjacent. Then H(j,k) is simply a scaled-down version of H*(j,k), obtained by dividing the latter by C*, the minimum CPL for the original problem. Both H*(j,k) and C* take account of all other constraints, such as block-length constraints.

The (m + 1) * (n + 1) numbers H(j,k) may be stored in a table or a matrix, known as the H-matrix. In principle, this matrix could be printed out in the same row by column format as the grid described in Section 2. The H-matrix has the following properties.

  1. All points on the optimal slotting(s) have H-value of 1.
  2. All other H-values are greater than 1.
  3. It is easy to compute a lower bound on the CPL for any combined sequence; simply find the maximum of the H-values along the path corresponding to that specific sequence. For example, if this maximum is 1.2 say, then the CPL for that combined sequence must be at least 1.2 times C*, the optimum CPL.
The H-matrix can be computed using a relatively straight-forward extension of the basic dynamic programming algorithm. This calculation is done automatically in my CONSLOT and PCSLOT programs (Clark, 1992). The H-matrix is displayed in a coded semi-graphical form, with all H-values of 1 printed as asterisks, and increasing H-values represented by successive letters of the alphabet. The threshold values for determining this latter coding are equally spaced on either a linear or logarithmic scale, as selected by the user.

This coded form of the H-matrix may be visualised as a crude contour map of a canyon. The line of asterisks, representing the optimal slotting(s), can be thought of as a river, running from the north-west to south-east corner. The letters of the alphabet represent the walls of the canyon; the further down the alphabet, the higher the wall. The steeper the walls of the canyon, the more tightly is the optimum slotting constrained, and the more reliable is that part of the slotting. Conversely, in regions where the bottom of the canyon is fairly flat, as indicated by an expanse of A's in the coded H-matrix, the optimum slotting is not strongly determined. This is because a change in the slotting in that region would make only a small increase in the CPL. To put it another way, those paths marked with an A are only slightly worse than the optimal path, while those marked with an R are very much worse.

If there is a unique optimal slotting, it will be represented by the "river" of asterisks which must be only one asterisk wide. Conversely, if at any stage the river widens out into a "lake", then there are multiple paths or slottings with the same minimum CPL. These alternative slottings can be reconstructed by taking alternative paths across each lake.

4. An Example. These ideas are well illustrated by the following example involving artificial data from two sequences each with 40 horizons. For simplicity, it is assumed that only one pollen variable is measured at each horizon. The data are plotted in Fig. 1.


Figure 1
Figure 1
The Y-variable is nearly constant over the last 10 horizons in both sequences. This implies that the distances between these horizons will be relatively small, and alternative slottings of these last 10 horizons will give almost the same total distance or dissimilarity. Hence this bottom part of any optimal slotting will not be very reliable. Conversely, the slotting will be much more reliable in regions where the Y-variable is changing rapidly, at about horizon 30 in A and 27 in B for example.

Fig. 2 shows the corresponding coded H-matrix as produced by PCSLOT, which confirms our predictions. The large "flat" area at the south-east corner, indicated by the array of consecutive A's and B's, confirms that this lower part of the slotting is not very reliable. Conversely, there are regions (such as j = 29, k = 28), where the canyon rises steeply from the river, indicated by C's or D's next to asterisks. This means that the corresponding portion of the optimal slotting is well-defined, in the sense that only a minor change in the path (i.e. combined sequence) will produce a substantial increase in the CPL.


 STANDARDISED H-MATRIX

  LOG.   SCALE
... 0.9000...*... 1.0001...A... 1.0347...B      1.8474...+

       TEST SEQUENCE B 

                    COLUMNS   0 TO  40 INCLUSIVE

          0         0         0         0         0
          0         1         2         3         4
          +    -    +    -    +    -    +    -    +
        0 *AGGIKKKLOPPPPPPPPPPPPPPPPPPPRRRRRRRRRRRR
 T        *AGGIKKKLOPPPPPPPPPPPPPPPPPPPRRRRRRRRRRRR
 E        **GGIJJJKOPPPPPPPPPPPPPPPPPPPRRRRRRRRRRRR
 S        A*EFHIIIJNOOOOOOOOOOOOOOOOOOPRRRRRRRRRRRR
 T        B*BBDEFFGKLLLLLLLLLLLLLLLNNOPRRRRRRRRRRRR
         -F***BCDDEIJJJJJJJJJJKKKLLNNOPRRRRRRRRRRRR
 S        HBB**ABBCGIIIIIIIIIIKKKLLNNOPRRRRRRRRRRRR
 E        JDDB*AAACGHIIIIIIIIIKKKLLNNOPRRRRRRRRRRRR
 Q        JDDB*****BDDDDDDDEGHKKKLLNNOPRRRRRRRRRRRR
 U        MHHFEEEC**AABCCCDEGHKKKLLNNOPRRRRRRRRRRRR
 E     10+MHHHHGGFB*AABCCCDEGHKKKLLNNOPRRRRRRRRRRRR
 N        MHHHHGGFB**ABCCCDEGHKKKLLNNOPRRRRRRRRRRRR
 C        MHHHHHHGBA*ABCCCDEGHKKKLLNNOPRRRRRRRRRRRR
 E        MHHHHHHGBA**BCCCDEGHKKKLLNNOPRRRRRRRRRRRR
          MHHHHHHGCA**BCCCDEGHKKKLLNNOPRRRRRRRRRRRR
 A       -MHHHHHHGCA**BCCCDEGHKKKLLNNOPRRRRRRRRRRRR
          MHHHHHHGCA**BCCCDEGHKKKLLNNOPRRRRRRRRRRRR
          MHHHHHHGCA**BCCCDDGGJKKKKMNOORRRRRRRRRRRR
          MHHHHHHGCAA**AAABBEFIIIJJLLNNQQQQQQQQQQQQ
          MHHHHHHGCCCA****ABDEHIIIILLMNPPPPPPPPPPPP
       20+MHHHHHHGDDDBAAA***CDGGHHHKKLMOOOOOOOOOOOO
          MHHHHHHGEEECCBBAA*ABEFFGGIIKKNNNNNNNNNNNN
          MHHHHHHGGGGEDDDCC***CDDEEGHIJMMMMMMMMMMMM
          MHHHHHHHHHHFEEEDDA**CCDDDGGIILLLLLLLLLLLL
          MHHHHHHHHHHFEEEDDA**CCCDDGGHILLLLLLLLLLLL
         -MHHHHHHHHHHFEEEDDA**AABBBEFGHKKKKKKKKKKKK
          MHHHHHHHHHHFEEEDDA*****AADDFGJJJJJJJJJJJJ
          MHHHHHHHHHHFEEEDDA*****AADDFGJJJJJJJJJJJJ
          PLKJHHHHHHHFEEEDDA*******AACDGGGGGGGGGGGG
          PNNNNNNNNNNMLLLKKIHEDDDD*****CCCCCCCCCCCC
       30+QQQQQQQQQQQPOOONNLLIHHHHEECB*BBBBBBBBBBBB
          RRRRRRRRRRRQPPPOOMMJJIIIFFDC*BBBBBBBBBBBB
          RRRRRRRRRRRQPPPOOMMJJIIIFFED*AAAAAAAABBBB
          RRRRRRRRRRRQPPPPONMJJJIIGGED*AAAAAAAABBBB
          RRRRRRRRRRRQQPPPPNMKJJJIGGED*AAAAAAAABBBB
         -RRRRRRRRRRRQQQQPPNNKKJJJHGFE******AAABBBB
          RRRRRRRRRRRRQQQPPNNKKKJJHGFEAAAA**AAABBBB
          RRRRRRRRRRRRQQQPPNNKKKJJHGFEAAAA****AABBB
          RRRRRRRRRRRRQQQPPNNKKKJJHGFEAAAAAAA*AAAAA
          RRRRRRRRRRRRQQQPPNNKKKJJHGFEAAAAAAA*AAAAA
       40+RRRRRRRRRRRRQQQPPNNKKKJJHGFEAAAAAAA******

Figure 2
In this example, the threshold values for the coding of the H-matrix are equally spaced on a logarithmic scale. All those points on the matrix coded with an A have an H-value between 1.0001 and 1.0347, those coded B lie between 1.0347 and (1.0347)2, and so on. The maximum H-value is 1.8474. More details on the various options associated with the H-matrix are given in the READ.ME file available with PCSLOT.

Multiple solutions are the rule rather than the exception when only one variable is measured, as here. The H-matrix contains two lakes, a narrow one ranging from horizons 13 to 18 in A, and a larger triangular one bounded by horizon 28 in A. This particular boundary is no coincidence, as the Y-variable for this horizon was deliberately mis-recorded as 0.876, rather than the "correct" 0.476. There are 5 different paths across or along the upper lake, and at least 15 alternative paths across the lower lake. Each combination of paths represents a slotting with the minimum CPL, so there are at least 75 alternative optimal slottings! The PCSLOT program gives just one of these, shown graphically in Fig. 3. When there are multiple solutions, PCSLOT generally chooses one corresponding to a path along the shore of each lake.


Figure 3
Figure 3
5. Additional Constraints. PCSLOT has the facility for the user to specify additional constraints on the combined sequence, for example that horizons A5 and B11 must be immediately adjacent. This constraint implies, for example, that B13 must not be immediately adjacent to A5, remembering that the horizons within each sequence must be kept in their correct order. In geometric terms, any path on the grid which passes through the point (5,13) corresponds to a prohibited slotting, since it violates that condition that A5 and B11 must be adjacent. Accordingly, H(5,13), the value of the H-matrix at (5,13), is set to infinity, and in PCSLOT the corresponding point on the coded H-matrix is printed as a `+'.

In fact, there will be a band of points on the grid corresponding to similar invalid or prohibited slottings, for example (5,14), (5,15), ... (7,11), (8,11), ..., (3,12), (4,12) to name just a few. There will be a corresponding band of + signs in the coded H-matrix. In this case, this diagram may be visualised as a contour map of the Grand Canyon, with the +'s indicating the surrounding plateau.

Each of the six different types of additional constraints available in PCSLOT will produce a similar band of + signs, indicating invalid slottings with infinite H-value. In an extreme case, the imposed constraints could be mutually contradictory. In this case, the H-matrix would consist entirely of + signs, but PCSLOT stops beforehand, and prints a warning message to indicate there is no slotting which meets all the imposed constraints, let alone an optimal slotting.

5. Sensitivity Analysis. The fourth question posed in the Introduction can be answered by a sensitivity analysis which involves leaving out one horizon at a time, and seeing what happens. For each horizon Aj in Sequence A in turn, PCSLOT temporarily deletes that horizon, and finds the optimal slotting between the remainder of Sequence A and the full sequence B. This is then compared with the optimal slotting based on the full data, by means of three summary statistics, which it calls CSTAT, MSTAT and NSTAT.

CSTAT is the reduction in optimal CPL when horizon Aj is omitted from Sequence A, while MSTAT and NSTAT give different measures of how much the A's and B's are shuffled about when Aj is omitted. NSTAT counts up the number of times an A is swapped for a B, and vice versa, while MSTAT gives the total distance the A's have moved.

Consider the following simple example in which we consider the effect of omitting A5. The first line gives the optimal combined slotting based on the full data (m = 6, n = 4) but for the purpose of the subsequent comparison, A5 is not shown. The second line shows the optimal combined sequence for the reduced sequence {A1,A2,A3,A4,A6} against sequence B.

A1 A2 B1 B2 A3 B3 A4 B4 A6

A1 A2 A3 B1 B2 B3 B4 A4 A6

There are a total of 4 positions in the combined sequence, indicated by bold print, where there is an A in one sequence and a B in the other, so NSTAT = 4. Horizons A1 and A2 are in the same position in both combined sequences, but A3 has been moved by 2 positions (to the left), and A4 has been moved 1 position (to the right). In this case, MSTAT = 2 + 1 = 3.

Large values of CSTAT indicate that the corresponding horizon Aj has a significant effect on the CPL. In such a case, MSTAT and NSTAT need not necessarily be large as well. Indeed, it is possible to have a large value CSTAT, but zero values of both MSTAT and NSTAT, indicating that the omission of Aj has no effect on the actual slotting.

Fig. 4 shows part of the Sensitivity Analysis produced by PCSLOT for the test data used in Section 3. Only the columns headed J, CSTAT, MSTAT and NSTAT should be looked at. This output confirms that the erroneous Y-value for A28 has a big influence on the CPL, as judged by CSTAT = 0.5567. All the remaining CSTAT values (except two) are identically zero, as might be expected from the smooth curve in Fig. 3. In situations where there is just one Y-variable (as here), the MSTAT and NSTAT figures should be ignored, because they vary according to which of the alternative multiple solutions were selected.

In rare cases, CSTAT, the reduction in CPL, can actually be negative, implying that the path length is longer when there is one less point to go through! This apparent anomaly can occur when additional order constraints are imposed.

PCSLOT does the Sensitivity Analysis for Sequence A only. To do it for Sequence B, interchange the sequences, so that what was A becomes B and vice versa. Any additional order constraints must be changed accordingly. For example, if horizon A50 in the original Sequence A must precede horizon B58 in the original Sequence B, then with the new inter-changed A's and B's, B50 must now precede A58.



 SENSITIVITY ANALYSIS     TEST SEQUENCE A 

  TOLERANCE FOR MULTIPLE SOLUTIONS =  0.10E-03

   J     CSTAT MSTAT NSTAT   JD: (Type, Opt.-K)....
   1    0.0226     7     6   1:   2   0;
   2    0.0000     7     6   1:   1   0; 4   1;
   3    0.0000     7     6   1:   1   0; 4   1;
  ..    ......    ...   ... ...................
  ..    ......    ...   ... ...................
  25    0.0000     3     4   1:   2  18;
  26    0.0000     3     4   1:   2  18;
  27    0.1592     2     2   2:   1  18; 2  18; 4  19;
  28    0.5567    10     4   1:   1  22;
  29    0.0000     2     2   2:   1  18; 1  19; 1  20; 1 21;
  ..    ......    ..    ... .....................
  ..    ......    ..    ... .....................
  38    0.0000     0     0   2:   1  32; 1  33; 4  35;
  39    0.0000     0     0   1:   2  35;
  40    0.0000     0     0   1:   1  35;

Figure 4
In conclusion, correlating or cross-matching two sequences of data by means of sequence-slotting is a complex process, and the results of any sequence-slotting algorithm should always be assessed carefully. The H-matrix provides a simple but informative graphical method for assessing the reliability of the optimal solution, for comparing near-optimal matchings, and for showing any multiple solutions. The sensitivity analysis complements this, by finding those observations having a big influence on the final result.

References.

Clark, R.M. 1992. Sequence comparisons and sequence-slotting. INQUA - Commission for the Study of the Holocene, Working Group on Data-Handling Methods Newsletter 8:3-6.

Gordon, A.D., Thompson, R. and Clark, R.M. 1988. The use of constraints in sequence-slotting. Data Analysis and Informatics, V (ed E. Diday), North Holland. pp. 353-364.


Copyright © 1993 Malcolm Clark
Home page
Newsletter 9 index
Author index
Subject index
WWW pages by K.D. Bennett