INQUA - COMMISSION FOR THE STUDY OF THE HOLOCENE Working Group on Data-handling Methods Newsletter 1, June 1988 At the XIIth INQUA meeting in Ottawa, 1987, Dr. Brigitta Amman, President- elect of the Holocene Commission, established a working group on data handling. The aims of the group are (1) To assemble a mailing list of colleagues who would be interested in both receiving and contributing to a flow of useful information on developments in computer and other technology that help us to handle, exchange, analyze and otherwise deal with our data more effectively. (2) We will attempt to tap as broad a group of sub-disciplines as possible, to include both physical and biological data categories. (3) We plan to act by producing a simple newsletter roughly once per year that will provide a mechanism of communication among all with an interest, and it might include up-to-date bibliographies on literature on data handling, quantitative methods, etc., as well as an inventory of colleagues with particular expertise and willingness to provide information, programmes, etc. (4) We will attempt to keep abreast of new modes of communication, so that our mailing list can include and perhaps eventually make use of such mechanisms as Bitnet, etc. A small group of colleagues has agreed to serve but we are open to additions, either by suggestion of others or volunteering. At present the group consists of: J.C. Ritchie John Birks University of Toronto University of Bergen Scarborough College Botanical Institute 1265 Military Trail N-5027 Bergen, NORWAY Scarborough, Ontario CANADA M1C 1A4 Bitnet: Ritchie@UToronto Louis J. Maher, Jr. Rick Battarbee Dept. of Geology & Geophysics Department of Geography University of Wisconsin University College London Madison, WI 53706 London, U.K. WCIH OAP Owen K. Davis Department of Geosciences University of Arizona Tucson, AZ 86721 Anyone who wishes to be added to the Working Group's mailing list and/or to contribute in any way, should please write or BITNET immediately to J.C. Ritchie, University of Toronto, Scarborough College, 1265 Military Trail, Scarborough, Ontario, Canada M1C 1A4. (Ritchie@UToronto) This issue contains: a short article on general aspects on handling pollen data with microcomputers; specific programmes for data entry and plotting; short reviews of recent useful relevant publications; miscellaneous short notes; and a preliminary electronic mail directory. Future issues will contain information on databases for diatoms and for peatland stratigraphy; Quaternary vertebrate data storage; a contribution from Lou Maher on his latest programmes and their availability; and other items that you will contribute. [*p.1 / p.2*] POLLEN DATA STORAGE AND ANALYSIS ON MICROCOMPUTERS K. Gajewski, Scarborough College, University of Toronto. Microcomputers have totally changed the way that data analysis is done. There is readily available a wide variety of statistical, graphics and database software. For most applications, microcomputers have enough memory and power to perform statistical analysis as well as store large amounts of paleoenvironmental data. Best of all is the easy and interactive access to your data. The cost is the large amount of time needed to learn the various applications as well as the expense of hardware and software. This note offers some thoughts about the storage of data on microcomputers and the exchange of data. A few comments about data archiving are based on experience creating a database of Canadian pollen diagrams. Data exchange There are two standard microcomputer systems: "IBM (DOS)" and "APPLE". Disks in one of these two formats are now the easiest way to exchange data. Unless there are literally millions of numbers, there is little reason to use mainframe tapes (exceptions include remote sensing data and some climate data sets, for example) as these are notoriously difficult to use and to transport between different systems. Of these two microcomputer standards, IBM is the largest, and most applications (statistical, database, etc.) are done on these. However, the APPLE standard always seems to come up with useful applications. The other way to send data, especially if the files are not too long, is through computer mail, namely BITNET (USA), NETNORTH (CANADA), EARNET (CONTINENTAL EUROPE), and JANET (UK). The first three are usually considered as one system, and there is rarely any problem communicating between them. JANET is a bit more troublesome (at least in our experience at Scarborough), but again, it works well once the initial contact is made and the correct name (account) and address (node) is entered into your NAMES file. These systems are used to send data files as well as letters, and they are a rapid and inexpensive way to communicate. See Nature 328:752-753 (27 Aug. 87) for more information about the system. To use these systems, it is necessary to have an account with the local mainframe computer that supports the system (most universities in North America do). Details vary depending on the local system, but it involves having an account name (your own) and node (the University, eg. RITCHIE@UTORONTO) which you send to your colleagues. Incoming mail appears in your MAILLIST, where it can be read or copied into your FILELIST. Incoming files (data, as opposed to letters) appear in your READERLIST, from which it can be RECEIVED into your FILELIST, and downloaded to your microcomputer. Data storage There are several possibilities for the storage of pollen (or other paleoenvironmental) data. Although there is a lot of pollen data around, efficient data storage is not the primary consideration. More important is ease of access, error free files, and good documentation. The type of storage system also depends on the use the data will be put to and the resources (money and [*p.2 / p.3*] programmers) available. Commercial programs for the storage and analysis of data are now available, and although expensive, provide a means to store data of various formats. Two useful types of commercial programs are database systems and spreadsheets, and the two most common of these are dBASE and LOTUS. dBASE is a relational database system designed to store and organize data. Each record of the file (sample) includes information such as location, data (YBP), sediment type, ...and all of the identified fossils. The particular data needed for an application can be extracted and used by your program. Strengths include a powerful system complete with a programming language, a large number of books and seminars that explain the system and a standard file type (DBF). It is relatively hard to learn, and the program has no graphics or analysis capabilities. LOTUS is much easier to learn, and there are also many books and seminars explaining the program. A spreadsheet basically stores the data as a matrix of "cells", and it is easy to enter and modify data, rearrange columns and rows, compute percentages and influx, produce chronology files, and best of all, see the data. There is some rudimentary statistics and graphics for preliminary data checking. In addition to the actual data, labels and documentation can be stored in the spreadsheets. LOTUS files are extremely inefficient of memory, nevertheless, we have over 100 Canadian pollen cores on file in 4-1.2 mb floppies. LOTUS files can be translated to be read by other programs such as dBASE and SASPC, and the LOTUS program can write normal ASCII files easily. There are two readily available, specialized methods of data storage, but more generally there are several methods currently in use, including: Matrix - where the columns are variables (pollen types) and the rows are cases (levels in a core, or different top samples). This is the method used to input data into the package statistical programs such as SAS, BMDP and SPSS-X. Attribute - in this, all of the cases (spectra) for one variable (taxon) are listed, followed by the next variable, etc. This is used by, for example, Webb & Bartlein (FOSSIL files). Variable/value - each variable for a level is listed in two parts, a 2-letter code (signifying the variable) and the value of the data point. This is used by Cushing (POLDATA), for example. Special - this would include, for example, SPSS system files, which are readable only by SPSS. These types of files include not only the data, but also other information about the dataset. They are only readable by the specific program (eg. SAS, SPSS) but make access to the data much easier. The provision for attached documentation is one of the strong points of these kinds of files. Most also include facilities to write data in ASCII for use by other programs. Many of these file types can be translated directly by other programs (for example, SASPC reads DIF and DDF files), although this may involve some manipulation. This is one of the arguments for using industry standard software: they are more expensive, but may save some time down the road. Matrix format, ASCII character is a good standard for exchange, as it is easily read, and useful in the most number of situations. In addition, the documentation, including the column and row headings, should always be included IN the files. This information can be edited out before use, but this ensures that the data are always documented. In the [*p.3 / p.4*] old days, large files full of numbers but with no explanation abounded, and these are totally useless, as the documentation is now lost. Problems of this format include the order of variables, making the addition of new taxa difficult, and the need to include columns of O's if there are no or only a few grains of a particular pollen type in a file. A few minor points: data should always have the decimal point punched in the data, rather than using a FORMAT to include it. Also, the data should be always stored as raw counts. The statistics program can compute the percentages each time the file is accessed, and this is good practice. In this way when the output is used in several years time, exactly what has been done can be traced. It is much better to have one file with the data (or for example, one file for each lake), and always use these data, computing % or influx for the application, rather than having hundreds of temporary files. SPECIFIC PROGRAMS Owen K. Davis, Department of Geosciences, University of Arizona. The programs we use at the University of Arizona are written in Basic and Turbo Pascal for the personal computer. All of them are in the public domain, and already have been traded with other palynologists. But first a WARNING. These programs are constantly being updated, primarily from basic to Turbo Pascal. More importantly they are NOT BULLET-PROOF. The error-handling procedures are fairly robust in the Compiled Basic programs, but new "undocumented features;" i.e., bugs, are discovered every time a new student uses the program. We have two main programs, a spread sheet for data entry and editing, and a plotting program that drives the Houston Instruments DMP-40 plotter. Here are their main menus: POLNSPSH.PAS (MAIN MENU): Read Pollen Data Matrix Change Names of Pollen Types Edit Pollen Data Create New Data Base Write Pollen Data to Disk Send Data to Printer QUIT during data entry: (F1) = Help (F2) = Enter Data by Columns (F3) = Enter Data by Rows (arrows) = Move Cell to Cell (PgUp PgDn = Screen up, Down (Ctrl RtAr) = Screen Right (Ctrl LfAr) = Screen Left (Ins) = Enter or Change Value (Home) = Change spelling of Pollen Type (Del) = Add/Delete Row/Column The data are stored in ASCII files (easy to work with, but bulky) beginning with a one line description, the number of rows and columns (currently 80 x 100 max), then the data arranged by pollen type with each value separated by a . POLPLOT.BAS (MAIN MENU): (1) READ POLLEN DATA MATRIX (2) SELECT POLLEN TYPES (3) WRITE POLLEN DATA TO DISK (4) LABEL DIAGRAM (5) DRAW DEPTH AXIS (6) DRAW CURVES (7) ADVANCE PEN (8) DRAW SEDIMENT COLUMN (9) CLOSE FILES AND QUIT There are several sub-menus. [*p.4 / p.5*] I have also written a series of support programs that use the POLNSPH file type. Most of these are in Basic and will be rewritten using Borlands (R) Numerical Methods programs for Turbo Pascal. POLUTIL.BAS: COMPLETE POLLEN SUMS CALCULATE CONC. OR INFLUX CALCULATE RATIOS DATA TRANSFORMATIONS SAVE DATA TO DISK POLREGRIN.BAS REGRESSION AND CORRELATION PLOT DATA ON SCREEN PLOT DATA ON DMP40 PLOTTER DATA TRANSFORMATIONS POLCURV.BAS COMPUTE B-SPLINE SMOOTHING FFT CONVOLE, DECONVOLVE COMPUTE FFT SPECTRAL ANALYSIS POLYNOMIAL FIT DIGITIZR.BAS runs a HI hi-pad digitizer. SET HORIZONTAL (Percentage) SCALE SET VERTICAL (Depth or Time) SCALE. ENTER POLLEN PERCENTAGES. POLSORT.BAS sorts pollen types of POLNSPSH files. MAP.BAS draws maps with the DMP-40 PAK2DAT.PAS converts POLNSPSH files (*.PAK) to DECORANA or SAS files. DISSIM.PAS calculates dissimilarity indices between selected types to two data sets and prints shaded diagrams of dissimilarity. I also have Basic and Pascal versions of programs to calculate insolation at various latitudes through time: PISIAS.BAS and PISIAS.PAS along with a few programs to manipulate the files output by the programs. RECENT PUBLICATIONS IN MULTIVARIATE DATA ANALYSIS AND PRACTICAL COMPUTING OF RELEVANCE TO PALAEOECOLOGISTS H.J.B. Birks, Botanical Institute, University of Bergen, Norway. In the last few years many books have appeared on multivariate data analysis written either for ecologists or applied statisticians and on practical aspects of computing. Several of these are of direct relevance and importance to palaeoecologists but because of their title or intended readership they might be missed by palaeoecologists. This would be unfortunate as several of these books are not only excellent but also are of great value to all palaeoecologists interested in quantitative analysis of their data using appropriate, robust, and powerful methods. I have found the following eight books to be particularly useful and relevant. P.G.N. Digby and R.A. Kempton 1987 Multivariate analysis of ecological communities. Chapman and Hall, London and New York, 206 pp. ISGN 0-412-24640- 6 (Hb), ISBN 0-412-24650-3 (Pb). This is written by two associates of John Gower, one of the leading British applied statisticians based on Rothamsted. Gower has developed several important techniques such as principal co-ordinates analysis, skew- symmetry analysis, and Procrustes rotation. The book is, not surprisingly, particularly strong on recent, 'Rothamsted-Gower' approaches, particularly geometrical scaling or ordination methods, comparison of ordinations by Procrustes rotation, and analysis of asymmetric association matrices. It provides some of the first intermediate-level mathematical [*p.5 / p.6*] of biplots, Procrustes rotation, and skew-symmetry analysis. It is rather weak on classification; for example it barely discusses two-way indicator species analysis and it does not mention sum-of-squares minimum-variance clustering at all. It is a very useful book that neatly complements A.D. Gordon's (1981) excellent Classification book (Chapman and Hall). R.H.G. Jongman, C.J.F. ter Braak, and O.F.R. van Togeren 1987 Data analysis in community and landscape ecology. Pudoc, Wageningen, 299 pp. ISBN 90-220-0908- 4. This is an excellent and extremely valuable book, although written primarily for community and landscape ecologists, much of it is directly relevant to palaeoecologists. The largest chapter is on ordination (by Cajo ter Braak) in which he builds up from indirect gradient analysis methods of correspondence analysis, detrended correspondence analysis, principal components analysis, and biplots to the new direct gradient analysis techniques of canonical correspondence analysis and redundancy analysis. Important and ecologically critical distinctions are made throughout between linear and unimodal responses between organisms and their environment and emphasis throughout is on ecological realism, interpretation, and robustness. There are also excellent chapters on regression, with a clear and simple explanation of logit regression (regression for presence/absence data) and of generalized linear models, on calibration and response functions, and on the analysis of spatial data, with clear introductions to spatial autocorrelation, spatial semivariance, and kriging (a powerful spatial-interpolation technique). Other chapters concern data collection, classification, and case studies. The book provides a wonderfully stimulating introduction to many new and powerful methods of data analysis that are of direct applicability in palaeoecology. It is written for those "who want to understand better the methods they are using and are eager to learn new, more powerful methods". It is a must for any quantitative palaeoecologist. I.T. Joliffe 1986 Principal component analysis. Springer-Verlag, New York, 271 pp. ISBN 0-387-96269-7. When I wrote my first principal components analysis program 16 years ago, I would never have guessed or believed that one day there would be a whole book devoted to PCA! PCA has become such a widely used (and misused) technique in so many disciplines that there is renewed interest in the theory and applications of PCA amongst applied statisticians. This book reflects this interest, as it is written by a statistician primarily for statisticians. It is a specialized text. Because of its strong mathematical form the book is unlikely to be read by those who should read it most unless they are prepared to take the trouble to understand the mathematics, whereas those who can easily follow the text will probably learn little from it. It contains new and useful material for palaeoecologists, including discussions of principal components in regression analysis, biplots, robust estimation, outlier detection, analysis of time series data, and analysis of closed, proportional data. Surprisingly it gives little discussion on how to interpret PCA results. The examples that are discussed tend to be interpreted solely in terms of contrasts between [*p.6 / p.7*] variables with high, extreme loadings. It is a valuable reference work for all who use PCA or related techniques. M.J. Greenacre 1984 Theory and applications of correspondence analysis. Academic Press, London and Orlando, 364 pp. ISBN 0-12-299050-1. The general technique of correspondence analysis has been frequently reinvented or rediscovered, and given different names such as dual or optimal scaling, reciprocal averaging, canonical analysis of contingency tables, and analyses of correspondences. The geometric approach of correspondence analysis has primarily been developed in France by Jean-Paul Benzecri and his school of French data-analysts. Benzecri's work, although increasingly quoted, is poorly known amongst Anglo-American statisticians. In this book Michael Greenacre gives the first extensive exposition of the Benzecri approach to correspondence analysis. It is a particularly valuable book for all who have tried (unsuccessfully in my case) to understand Benzecri's texts but who want to understand the theory underlying widely used techniques such as reciprocal averaging or detrended correspondence analysis. Because the Benzecri school primarily developed their approach in the context of linguistics, Greenacre emphasizes the use of correspondence analysis to summarize data in contingency tables. He does not develop the unimodal/weighted averaging basis of correspondence analysis that is so important in ecology and palaeoecology and that ter braak has developed so strongly and so effectively in Jongman et al. (1987). Greenacre's book is certainly useful, but its importance to palaeoecologists has largely been superseded by ter Braak's chapter in Jongman et al. (1987). Pierre Legendre and Louis Legendre 1987 Developments in numerical ecology. Springer-Verlag, Berlin, 585 pp. ISBN 3-540-16086-8. This massive book presents the proceedings of a NATO Advanced Research Workshop on Numerical Ecology held in France in June 1986. It comprises a series of invited lectures given primarily by applied statisticians or data analysts in related disciplines on new (sometimes very new, sometimes not so new) approaches to the analysis of ecological data, followed by a series of fascinating reports by working groups of ecologists and mathematicians on the possible application of these new approaches in six broad branches of ecology. The new approaches discussed are scaling methods including non-linear scaling, unfolding techniques, and two- and three-way multidimensional scaling, fuzzy set clustering, constrained and conditional clusterings, fractal theory, path analysis, and spatial point patterns and spatial autocorrelation. The possible value (or otherwise!) of these novel approaches is critically considered for microbial, marine benthic, marine pelagic, limnology, terrestrial plant, and terrestrial animal ecology in the working group reports. Unfortunately palaeoecology is not considered. Some of the 'new' approaches are well established in palaeoecology (e.g. constrained clustering). Some have obvious, potential palaeoecological applications (e.g. fuzzy set clustering, spatial autocorrelation, constrained scalings and unfolding techniques, path analysis). Others such as fractal theory and various elaborations of multidimensional scaling appear to have little or no relevance to palaeoecologists (or to [*p.7 / p.8*] terrestrial animal ecologists if Dan Simberloff's working group report on Dirty data and clean questions is representative!). There is a lot of exciting and thoughtful material in this volume. Only time will tell whether the choice of new approaches made by the Legendre brothers will turn out to be useful to numerical ecologists. Certainly the volume is useful to quantitative palaeoecologists in highlighting novel techniques that we should think about and even try. J. Aitchison 1986 The statistical analysis of compositional data. Chapman and Hall, London and New York, 416 pp. ISBN 0-412-18060-4. In 1897 Karl Pearson showed the dangers of using percentages or proportions in many statistical analyses, commonly resulting in 'spurious correlations'. Despite contributions in the 1960's and 1970's from statisticians and quantitative geologists (e.g. J.E. Mosimann, F. Chayes, J.C. Butler) the problems of analyzing closed data have remained unresolved until a series of papers by John Aitchison appeared between 1981 and 1984. Now in this important book Aitchison has built on these papers to provide the first major contribution to many of the problems associated with closed data, induced spurious correlations, and constant-sum constraints. Aitchison's approach usually involves logarithms of ratios and extends from standard techniques of regression, principal components analysis and canonical correlation analysis to questions such as complete subcompositional independence, subcompositional invariance, and partition independence, all of which are unique to percentage data. Many standard multivariate techniques are not appropriate with closed data, but Aitchison provides a diverse and challenging armoury of modelling and statistical testing techniques appropriate solely for percentages. The book is clearly written for statisticians and Aitchison presents many proofs, properties, and definitions for his statistical colleagues. However, in view of the central importance of the whole book to quantitative palaeoecology, the time and effort involved in understanding the text are worthwhile. A series of BASIC programs called CODA for IBM PC-compatible is available that solves the many examples given in the text. The programs are particularly useful in working through the book and helping to understand the new techniques. Aitchison emphasizes that his log-ratios, log-linear contrasts, and additive logistic normal distribution are not the last word on this topic (see, for example, Gower's paper in Legendre and Lelgendre 1987!), but modestly suggests that his approach can produce results that "surprise many geologists". We are all in for surprises when we read this book and begin to apply Aitchison's methods in quantitative palaeoecology! R. Gittins 1985 Canonical analysis - A review with applications in ecology. Springer-Verlag, Berlin, 351 pp. ISBN 3-540-13617-7. This book, like Greenacre's on correspondence analysis and Joliffe's on principal components analysis, provides an in-depth and very detailed review of one multivariate technique, namely canonical correlation analysis including canonical correlation analysis and canonical variates analysis (= multiple discriminant analysis). About one-third of the book deals with theory and mathematical relationships. One third presents a series of detailed [*p.8 / p.9*] ecological case studies. The remainder of the book concerns an assessment of canonical analysis and future developments. The book is primarily about canonical correlation analysis, a technique that has obvious appeal to ecologists as a means of studying relationships between two sets of multivariate data (e.g. vegetation and soils). It has, however, a crippling set of assumptions, in particular linear relationships between variables. As a result it has never been a particularly useful or appropriate technique in ecology. Within ecology and palaeoecology, canonical correlation analysis is now largely superseded by ter Braak's canonical correspondence analysis that assumes unimodal responses between biological and environmental variables (see Jongman et al. 1987). Although there is an enormous amount of information about multivariate data analysis in general in Gittin's book, its main concern, canonical correlation analysis, is today only really of theoretical interest to ecologists and palaeoecologists. Gittin's view that "canonical analysis exists primarily to be used" seems excessively pragmatic when we know that many of its assumptions are biologically unrealistic. It is surely better to refrain from using techniques that exist but are inappropriate than to use techniques simply because they exist. For many quantitative palaeoecologists Gittin's book is now largely replaced by Jongman et al. (1987) where canonical correlation analysis is critically discussed and evaluated in the context of other canonical, constrained ordination techniques. Considerable time and effort are needed to read and understand Gittin's book. The time is perhaps better spent with Jongman et al. (1987) or Digby and Kempton (1987). W.H. Press, B.P. Flannery, S.A. Teukolsky, and W.T. Vettreling 1986. Numerical recipes - the art of scientific computing. Cambridge University Press, Cambridge, 818 pp. ISBN 0-521-30811-9. This is a wonderful and much needed book. It presents and discusses efficient and reliable algorithms and FORTRAN and PASCAL source listing for all the main aspects of numerical analysis - solution of linear algebraic equations and matrix manipulations; interpolation, extrapolation and splines; integration of functions; evaluation of functions; derivation of special functions such as factorials, gamma functions, etc.; random numbers; sorting; roots and non-linear equations; function optimization; eigenvalues and eigenvectors, Fourier transforms; basic statistics; modelling, integration of differential equations; partial differential equations; and two point boundary value problems. Its 200+ subroutines are available on 5 1/4" diskettes (FORTRAN ISBN 0-521-30957-3; PASCAL ISBN 0-521 309854-9) along with example books and programs (FORTRAN ISBN 0-521-31330-9 and ISBN 0-521-30958-1; PASCAL ISBN 0-521-30956-5 and ISBN 0-521-20955-7). The FORTRAN subroutines are in FORTRAN 77 and work without difficulty in an IBM AT computer with either the Professional FORTRAN 1.0 compiler or the Microsoft FORTRAN 3.3 compiler. I have no experience of the PASCAL versions. I know of no other book like this that covers so much material about difficult but important programming problems in a clear and concise way written for scientists rather than professional computer scientists. I only wish it had been available 10-15 years ago - many, many hours of frustration would have been saved! Strongly recommended. [*p.9 / p.10*] MISCELLANEOUS NOTES From: The Department of Biogeography and Geomorphology, Australian National University. Storage, analysis and plotting of pollen data at the Canberra laboratory has been done with POLSTA, an interactive time series analysis package developed by Dr. David Green (Green 1983, Pollen et Spores 25:531-540). The feasibility of adapting POLSTA for use on personal computers is being explored, and comment on the desirability of such a step would be welcomed by: Gary Dolman, Biogeography and Geomorphology, Australian National University, GOP Box 4, Canberra ACT 2601. From: R. Bonnefille, Marseille, France (Laboratoire de Geologie du Quaternaire, CNRS Luminy - Case 907, 13288 Marseille Cedex 9, FRANCE) The current status of the Pollen Data Bank is: 478 modern pollen sites and 1241 fossil, from Ethiopia, Kenya, Burundi, Senegal, Togo, Congo, Tanzania and others, comprising 1719 levels and 1080 taxa. AZERTY@FRMOP11. At the same laboratory, a G1PAL programme for analyzing pollen data (plotting, zoning, percentage and accumulate rate frequencies), C. Goeury and J. Guiot, CLIMAT@FRMOPII. [ Email addresses (not reproduced) extend onto p. 11. ] --- End of Newsletter 1, June 1988 ---