INQUA Working Group on Data-Handling Methods

Newsletter 1: June 1988

POLLEN DATA STORAGE AND ANALYSIS ON MICROCOMPUTERS

K. Gajewski, Scarborough College, University of Toronto.

Microcomputers have totally changed the way that data analysis is done. There is readily available a wide variety of statistical, graphics and database software. For most applications, microcomputers have enough memory and power to perform statistical analysis as well as store large amounts of paleoenvironmental data. Best of all is the easy and interactive access to your data. The cost is the large amount of time needed to learn the various applications as well as the expense of hardware and software.

This note offers some thoughts about the storage of data on microcomputers and the exchange of data. A few comments about data archiving are based on experience creating a database of Canadian pollen diagrams.

Data exchange
There are two standard microcomputer systems: "IBM (DOS)" and "APPLE". Disks in one of these two formats are now the easiest way to exchange data. Unless there are literally millions of numbers, there is little reason to use mainframe tapes (exceptions include remote sensing data and some climate data sets, for example) as these are notoriously difficult to use and to transport between different systems. Of these two microcomputer standards, IBM is the largest, and most applications (statistical, database, etc.) are done on these. However, the APPLE standard always seems to come up with useful applications.

The other way to send data, especially if the files are not too long, is through computer mail, namely BITNET (USA), NETNORTH (CANADA), EARNET (CONTINENTAL EUROPE), and JANET (UK). The first three are usually considered as one system, and there is rarely any problem communicating between them. JANET is a bit more troublesome (at least in our experience at Scarborough), but again, it works well once the initial contact is made and the correct name (account) and address (node) is entered into your NAMES file. These systems are used to send data files as well as letters, and they are a rapid and inexpensive way to communicate. See Nature 328:752-753 (27 Aug. 87) for more information about the system.

To use these systems, it is necessary to have an account with the local mainframe computer that supports the system (most universities in North America do). Details vary depending on the local system, but it involves having an account name (your own) and node (the University, eg. RITCHIE@UTORONTO) which you send to your colleagues. Incoming mail appears in your MAILLIST, where it can be read or copied into your FILELIST. Incoming files (data, as opposed to letters) appear in your READERLIST, from which it can be RECEIVED into your FILELIST, and downloaded to your microcomputer.

Data storage
There are several possibilities for the storage of pollen (or other paleoenvironmental) data. Although there is a lot of pollen data around, efficient data storage is not the primary consideration. More important is ease of access, error free files, and good documentation. The type of storage system also depends on the use the data will be put to and the resources (money and programmers) available. Commercial programs for the storage and analysis of data are now available, and although expensive, provide a means to store data of various formats. Two useful types of commercial programs are database systems and spreadsheets, and the two most common of these are dBASE and LOTUS.

dBASE is a relational database system designed to store and organize data. Each record of the file (sample) includes information such as location, data (YBP), sediment type, ...and all of the identified fossils. The particular data needed for an application can be extracted and used by your program. Strengths include a powerful system complete with a programming language, a large number of books and seminars that explain the system and a standard file type (DBF). It is relatively hard to learn, and the program has no graphics or analysis capabilities.

LOTUS is much easier to learn, and there are also many books and seminars explaining the program. A spreadsheet basically stores the data as a matrix of "cells", and it is easy to enter and modify data, rearrange columns and rows, compute percentages and influx, produce chronology files, and best of all, see the data. There is some rudimentary statistics and graphics for preliminary data checking. In addition to the actual data, labels and documentation can be stored in the spreadsheets. LOTUS files are extremely inefficient of memory, nevertheless, we have over 100 Canadian pollen cores on file in 4-1.2 mb floppies. LOTUS files can be translated to be read by other programs such as dBASE and SASPC, and the LOTUS program can write normal ASCII files easily.

There are two readily available, specialized methods of data storage, but more generally there are several methods currently in use, including:

Matrix format, ASCII character is a good standard for exchange, as it is easily read, and useful in the most number of situations. In addition, the documentation, including the column and row headings, should always be included IN the files. This information can be edited out before use, but this ensures that the data are always documented. In the old days, large files full of numbers but with no explanation abounded, and these are totally useless, as the documentation is now lost. Problems of this format include the order of variables, making the addition of new taxa difficult, and the need to include columns of O's if there are no or only a few grains of a particular pollen type in a file.

A few minor points: data should always have the decimal point punched in the data, rather than using a FORMAT to include it. Also, the data should be always stored as raw counts. The statistics program can compute the percentages each time the file is accessed, and this is good practice. In this way when the output is used in several years time, exactly what has been done can be traced. It is much better to have one file with the data (or for example, one file for each lake), and always use these data, computing % or influx for the application, rather than having hundreds of temporary files.


Copyright © 1988 K. Gajewski
Home page
Newsletter 1 index
Author index
Subject index
WWW pages by K.D. Bennett