1. Introduction. In the previous Newsletter (Clark, 1992), I discussed the problem of combining, in an optimal fashion, two ordered sequences of data into a single combined sequence, subject to various order constraints.
For example, the sequences could correspond to pollen data from two cores A and B, with pollen counts made at horizons A1, A2, ..., Am in Sequence A and at horizons B1, B2, ..., Bn in Sequence B. A measure of dissimilarity (or "distance") is computed between the pollen counts at any two horizons from either core. Then the total dissimilarity (or "combined path length" CPL) for any proposed combined sequence of A's and B's is obtained by adding the distances between adjacent or consecutive horizons in the combined sequence. The problem then is to find the best combined sequence, i.e. the one with minimum total dissimilarity or minimum CPL.
This optimal combined sequence (and corresponding minimum CPL, denoted here by C*) can be readily found using algorithms based on dynamic programming techniques. Since there are only a finite number of possible combined sequences, all these sequence-slotting algorithms will produce an answer, no matter what data they are applied to. This means that the results of any algorithm need to be assessed carefully. Questions which may be asked are:
2. Geometric Interpretation. To understand the H-matrix, we first note that any combined sequence or slotting may be represented geometrically as a path across a certain (m + 1) * (n + 1) grid. This grid has (m + 1) rows, labelled 0,1,2,... up to m, numbered from the top down, corresponding to the horizons in Sequence A. Similarly, it has (n + 1) columns, labelled 0,1,2,... up to n, numbered from left to right, corresponding to the horizons in sequence B. Each possible slotting can be represented by a path across the grid, starting at the top left (or north-west) corner, with co-ordinates (0,0), and ending at the bottom right (or south-east) corner, with co-ordinates (m,n). The path is constructed step-by-step, according to the following rules.
The combined sequence may start with a block of consecutive A's, or a block of B's. This is why the grid contains an extra row, labelled 0, and an extra column, labelled 0.
With this method of construction, any slotting may be represented by a path, and any path corresponds to a possible slotting (provided the path obeys rules (1) and (2) above). For example, the combined sequence
A1 A2 B1 A3 B2 A4 A5 B3
is represented geometrically by the following path.
The (m + 1) * (n + 1) numbers H(j,k) may be stored in a table or a matrix, known as the H-matrix. In principle, this matrix could be printed out in the same row by column format as the grid described in Section 2. The H-matrix has the following properties.
This coded form of the H-matrix may be visualised as a crude contour map of a canyon. The line of asterisks, representing the optimal slotting(s), can be thought of as a river, running from the north-west to south-east corner. The letters of the alphabet represent the walls of the canyon; the further down the alphabet, the higher the wall. The steeper the walls of the canyon, the more tightly is the optimum slotting constrained, and the more reliable is that part of the slotting. Conversely, in regions where the bottom of the canyon is fairly flat, as indicated by an expanse of A's in the coded H-matrix, the optimum slotting is not strongly determined. This is because a change in the slotting in that region would make only a small increase in the CPL. To put it another way, those paths marked with an A are only slightly worse than the optimal path, while those marked with an R are very much worse.
If there is a unique optimal slotting, it will be represented by the "river" of asterisks which must be only one asterisk wide. Conversely, if at any stage the river widens out into a "lake", then there are multiple paths or slottings with the same minimum CPL. These alternative slottings can be reconstructed by taking alternative paths across each lake.
4. An Example. These ideas are well illustrated by the following example involving artificial data from two sequences each with 40 horizons. For simplicity, it is assumed that only one pollen variable is measured at each horizon. The data are plotted in Fig. 1.

Fig. 2 shows the corresponding coded H-matrix as produced by PCSLOT, which confirms our predictions. The large "flat" area at the south-east corner, indicated by the array of consecutive A's and B's, confirms that this lower part of the slotting is not very reliable. Conversely, there are regions (such as j = 29, k = 28), where the canyon rises steeply from the river, indicated by C's or D's next to asterisks. This means that the corresponding portion of the optimal slotting is well-defined, in the sense that only a minor change in the path (i.e. combined sequence) will produce a substantial increase in the CPL.
STANDARDISED H-MATRIX
LOG. SCALE
... 0.9000...*... 1.0001...A... 1.0347...B 1.8474...+
TEST SEQUENCE B
COLUMNS 0 TO 40 INCLUSIVE
0 0 0 0 0
0 1 2 3 4
+ - + - + - + - +
0 *AGGIKKKLOPPPPPPPPPPPPPPPPPPPRRRRRRRRRRRR
T *AGGIKKKLOPPPPPPPPPPPPPPPPPPPRRRRRRRRRRRR
E **GGIJJJKOPPPPPPPPPPPPPPPPPPPRRRRRRRRRRRR
S A*EFHIIIJNOOOOOOOOOOOOOOOOOOPRRRRRRRRRRRR
T B*BBDEFFGKLLLLLLLLLLLLLLLNNOPRRRRRRRRRRRR
-F***BCDDEIJJJJJJJJJJKKKLLNNOPRRRRRRRRRRRR
S HBB**ABBCGIIIIIIIIIIKKKLLNNOPRRRRRRRRRRRR
E JDDB*AAACGHIIIIIIIIIKKKLLNNOPRRRRRRRRRRRR
Q JDDB*****BDDDDDDDEGHKKKLLNNOPRRRRRRRRRRRR
U MHHFEEEC**AABCCCDEGHKKKLLNNOPRRRRRRRRRRRR
E 10+MHHHHGGFB*AABCCCDEGHKKKLLNNOPRRRRRRRRRRRR
N MHHHHGGFB**ABCCCDEGHKKKLLNNOPRRRRRRRRRRRR
C MHHHHHHGBA*ABCCCDEGHKKKLLNNOPRRRRRRRRRRRR
E MHHHHHHGBA**BCCCDEGHKKKLLNNOPRRRRRRRRRRRR
MHHHHHHGCA**BCCCDEGHKKKLLNNOPRRRRRRRRRRRR
A -MHHHHHHGCA**BCCCDEGHKKKLLNNOPRRRRRRRRRRRR
MHHHHHHGCA**BCCCDEGHKKKLLNNOPRRRRRRRRRRRR
MHHHHHHGCA**BCCCDDGGJKKKKMNOORRRRRRRRRRRR
MHHHHHHGCAA**AAABBEFIIIJJLLNNQQQQQQQQQQQQ
MHHHHHHGCCCA****ABDEHIIIILLMNPPPPPPPPPPPP
20+MHHHHHHGDDDBAAA***CDGGHHHKKLMOOOOOOOOOOOO
MHHHHHHGEEECCBBAA*ABEFFGGIIKKNNNNNNNNNNNN
MHHHHHHGGGGEDDDCC***CDDEEGHIJMMMMMMMMMMMM
MHHHHHHHHHHFEEEDDA**CCDDDGGIILLLLLLLLLLLL
MHHHHHHHHHHFEEEDDA**CCCDDGGHILLLLLLLLLLLL
-MHHHHHHHHHHFEEEDDA**AABBBEFGHKKKKKKKKKKKK
MHHHHHHHHHHFEEEDDA*****AADDFGJJJJJJJJJJJJ
MHHHHHHHHHHFEEEDDA*****AADDFGJJJJJJJJJJJJ
PLKJHHHHHHHFEEEDDA*******AACDGGGGGGGGGGGG
PNNNNNNNNNNMLLLKKIHEDDDD*****CCCCCCCCCCCC
30+QQQQQQQQQQQPOOONNLLIHHHHEECB*BBBBBBBBBBBB
RRRRRRRRRRRQPPPOOMMJJIIIFFDC*BBBBBBBBBBBB
RRRRRRRRRRRQPPPOOMMJJIIIFFED*AAAAAAAABBBB
RRRRRRRRRRRQPPPPONMJJJIIGGED*AAAAAAAABBBB
RRRRRRRRRRRQQPPPPNMKJJJIGGED*AAAAAAAABBBB
-RRRRRRRRRRRQQQQPPNNKKJJJHGFE******AAABBBB
RRRRRRRRRRRRQQQPPNNKKKJJHGFEAAAA**AAABBBB
RRRRRRRRRRRRQQQPPNNKKKJJHGFEAAAA****AABBB
RRRRRRRRRRRRQQQPPNNKKKJJHGFEAAAAAAA*AAAAA
RRRRRRRRRRRRQQQPPNNKKKJJHGFEAAAAAAA*AAAAA
40+RRRRRRRRRRRRQQQPPNNKKKJJHGFEAAAAAAA******
Figure 2
Multiple solutions are the rule rather than the exception when only one variable is measured, as here. The H-matrix contains two lakes, a narrow one ranging from horizons 13 to 18 in A, and a larger triangular one bounded by horizon 28 in A. This particular boundary is no coincidence, as the Y-variable for this horizon was deliberately mis-recorded as 0.876, rather than the "correct" 0.476. There are 5 different paths across or along the upper lake, and at least 15 alternative paths across the lower lake. Each combination of paths represents a slotting with the minimum CPL, so there are at least 75 alternative optimal slottings! The PCSLOT program gives just one of these, shown graphically in Fig. 3. When there are multiple solutions, PCSLOT generally chooses one corresponding to a path along the shore of each lake.

In fact, there will be a band of points on the grid corresponding to similar invalid or prohibited slottings, for example (5,14), (5,15), ... (7,11), (8,11), ..., (3,12), (4,12) to name just a few. There will be a corresponding band of + signs in the coded H-matrix. In this case, this diagram may be visualised as a contour map of the Grand Canyon, with the +'s indicating the surrounding plateau.
Each of the six different types of additional constraints available in PCSLOT will produce a similar band of + signs, indicating invalid slottings with infinite H-value. In an extreme case, the imposed constraints could be mutually contradictory. In this case, the H-matrix would consist entirely of + signs, but PCSLOT stops beforehand, and prints a warning message to indicate there is no slotting which meets all the imposed constraints, let alone an optimal slotting.
5. Sensitivity Analysis. The fourth question posed in the Introduction can be answered by a sensitivity analysis which involves leaving out one horizon at a time, and seeing what happens. For each horizon Aj in Sequence A in turn, PCSLOT temporarily deletes that horizon, and finds the optimal slotting between the remainder of Sequence A and the full sequence B. This is then compared with the optimal slotting based on the full data, by means of three summary statistics, which it calls CSTAT, MSTAT and NSTAT.
CSTAT is the reduction in optimal CPL when horizon Aj is omitted from Sequence A, while MSTAT and NSTAT give different measures of how much the A's and B's are shuffled about when Aj is omitted. NSTAT counts up the number of times an A is swapped for a B, and vice versa, while MSTAT gives the total distance the A's have moved.
Consider the following simple example in which we consider the effect of omitting A5. The first line gives the optimal combined slotting based on the full data (m = 6, n = 4) but for the purpose of the subsequent comparison, A5 is not shown. The second line shows the optimal combined sequence for the reduced sequence {A1,A2,A3,A4,A6} against sequence B.
A1 A2 B1 B2 A3 B3 A4 B4 A6
A1 A2 A3 B1 B2 B3 B4 A4 A6
There are a total of 4 positions in the combined sequence, indicated by bold print, where there is an A in one sequence and a B in the other, so NSTAT = 4. Horizons A1 and A2 are in the same position in both combined sequences, but A3 has been moved by 2 positions (to the left), and A4 has been moved 1 position (to the right). In this case, MSTAT = 2 + 1 = 3.
Large values of CSTAT indicate that the corresponding horizon Aj has a significant effect on the CPL. In such a case, MSTAT and NSTAT need not necessarily be large as well. Indeed, it is possible to have a large value CSTAT, but zero values of both MSTAT and NSTAT, indicating that the omission of Aj has no effect on the actual slotting.
Fig. 4 shows part of the Sensitivity Analysis produced by PCSLOT for the test data used in Section 3. Only the columns headed J, CSTAT, MSTAT and NSTAT should be looked at. This output confirms that the erroneous Y-value for A28 has a big influence on the CPL, as judged by CSTAT = 0.5567. All the remaining CSTAT values (except two) are identically zero, as might be expected from the smooth curve in Fig. 3. In situations where there is just one Y-variable (as here), the MSTAT and NSTAT figures should be ignored, because they vary according to which of the alternative multiple solutions were selected.
In rare cases, CSTAT, the reduction in CPL, can actually be negative, implying that the path length is longer when there is one less point to go through! This apparent anomaly can occur when additional order constraints are imposed.
PCSLOT does the Sensitivity Analysis for Sequence A only. To do it for Sequence B, interchange the sequences, so that what was A becomes B and vice versa. Any additional order constraints must be changed accordingly. For example, if horizon A50 in the original Sequence A must precede horizon B58 in the original Sequence B, then with the new inter-changed A's and B's, B50 must now precede A58.
SENSITIVITY ANALYSIS TEST SEQUENCE A TOLERANCE FOR MULTIPLE SOLUTIONS = 0.10E-03 J CSTAT MSTAT NSTAT JD: (Type, Opt.-K).... 1 0.0226 7 6 1: 2 0; 2 0.0000 7 6 1: 1 0; 4 1; 3 0.0000 7 6 1: 1 0; 4 1; .. ...... ... ... ................... .. ...... ... ... ................... 25 0.0000 3 4 1: 2 18; 26 0.0000 3 4 1: 2 18; 27 0.1592 2 2 2: 1 18; 2 18; 4 19; 28 0.5567 10 4 1: 1 22; 29 0.0000 2 2 2: 1 18; 1 19; 1 20; 1 21; .. ...... .. ... ..................... .. ...... .. ... ..................... 38 0.0000 0 0 2: 1 32; 1 33; 4 35; 39 0.0000 0 0 1: 2 35; 40 0.0000 0 0 1: 1 35;Figure 4
References.
Clark, R.M. 1992. Sequence comparisons and sequence-slotting. INQUA - Commission for the Study of the Holocene, Working Group on Data-Handling Methods Newsletter 8:3-6.
Gordon, A.D., Thompson, R. and Clark, R.M. 1988. The use of constraints in sequence-slotting. Data Analysis and Informatics, V (ed E. Diday), North Holland. pp. 353-364.