INQUA Working Group on Data-Handling Methods

Newsletter 9: January 1993

DATABASING THE WORLD

David G. Green
Centre for Information Science Research
Australian National University
GPO Box 4 Canberra 2601 AUSTRALIA
E-mail: david.green@anu.edu.au

Computers are forever challenging us with new ways of doing science. Now that computers are a familiar sight in the laboratory the next challenge is to adapt to thousands of computers all joined together. Even by conservative estimates, thousands of institutions and perhaps millions of researchers are now served by Internet (Krol, 1992), a vast communications web that links together computers all around the world.

The services and information available on Internet are astounding. Access to world-wide electronic mail and electronic newsgroups covering hundreds of topics are just the beginning. Being connected to Internet means having the resources of literally thousands of computers at your fingertips. Recognizing the advantages of free information exchange, many computer sites now allow guest logins by users over the network. What is more they make available various data, software and services that can be freely copied or used. The following examples can only hint at the incredible range of information already available:

* On-line access to telephone directories, bibliographies and library catalogs in many parts of the world.

* Free software - many sites maintain libraries of public domain software. The Free Software Foundation at MIT develops and distributes high-quality, free software under its GNU Project.

* Molecular biology databases, software and bibliographies - the Australian National Genomic Information Service (ANGIS) at the University of Sydney maintains up-to-date copies of the major databases.

* Satellite and weather data - the University of New Mexico alone makes available 90 gigabytes worth!

* Geographic data - electronic atlases, census data and summaries such as the CIA World Databank and Factbook (maps, facts and figures about every country in the world).

* Electronic texts - Project Gutenberg, a public domain project, produces electronic versions of English language texts, ranging from Roget's Thesaurus and the Complete Works of Shakespeare to the CIA World Factbook and US Census.

For several years now the basic means of accessing files across the network has been FTP ("File Transfer Protocol"). Network archives use the "anonymous ftp" protocol. For example

====================================================
ftp life.anu.edu.au  (logging in to the site LIFE at)                   
                 (the Australian National University)
Connected to life.anu.edu.au.
220 life FTP server (SunOS 4.1) ready.
Name (life.anu.edu.au:david):  anonymous
331 Guest login ok, send ident as password.
Password:         (give your electronic mail address)
230 Guest login ok, access restrictions apply.
ftp> ls               (gives you a directory listing)
ftp> cd /pub/biomathematics 
           (changes directory to /pub/biomathematics)
ftp> bin                     (changes mode to binary)
ftp> get polsta.zip   (retrieves the file polsta.zip)
===================================================
The number of network archives has grown rapidly, so that finding information, or even knowing what is available, among the thousands of sites is extremely time-consuming. ARCHIE resolves this problem by providing a database of the contents of all known sites. These databases are provided at several major sites, such as archie.au (Australia), archie.funet.fi (Finland) and archie.mcgill.ca (Canada). They can be queried either by logging in directly via Telnet (using the name "archie"), by electronic mail (e.g. to archie@archie.au) with the message consisting of keywords (e.g. "help"). For example
====================================================
telnet archie.au     (connecting to the local archie
                                             server)
Trying 139.130.4.6 ...
Connected to archie.au.
Escape character is '^]'.

SunOS UNIX (plaza.aarnet.EDU.AU)

login: archie  (log in name is "archie", no password)

YOU ARE RUNNING ON ARCHIE.AU (sometimes known as
                                 plaza.aarnet.EDU.AU)

If you have any problems with archie,
   send mail to ccw@archie.au

This machine is a brand spanking new SparcStation 2 purchased by AARNet
funds to further serve the AARNet community.  The machine lives directly on
the AARNet backbone so should provide excellentconnectivity to all points
of AARNet.

archie> help                (asking for help) 

Help gives you information about various topics, including all the commands
that are available and how to use them. ... etc.

archie> quit                (finishing a session)
====================================================
User-friendly interfaces, such as XARCHIE (Fig. 1), now make it possible to locate and retrieve files at the touch of a button.
Figure 1
Figure 1
Recently, several other protocols have appeared that allow a more systematic approach to searching the network. They include WAIS (Wide Area Information Servers), World Wide Web and Gopher. In Gopher, which has spread the fastest, the user retrieves files by selecting from a menu. Since the menu normally includes links to other Gopher servers, it is possible to hop from site to site. The recent introduction of an indexing system (Veronica) means that users can create and use customized menus "on the fly" (Fig. 2).
Figure 2
Figure 2
Network publication. With publication delays often running into years, researchers are increasingly turning to Internet to distribute their results quickly. Furthermore the sheer number of journals means that published work is often missed by other researchers. Electronic collections of papers and references provide a way to communicate research results and innovations.

Network publications (e.g. electronic journals) need not be limited to the text and figures of traditional paper publications. Other material can include bibliographies, databases, and software. For instance, the fastest and simplest way to distribute software is to make it freely available on Internet.

At present the main drawback to electronic publication on Internet is lack of formal recognition. However, librarians, publishers and site managers are now working on such issues as registering electronic publications and establishing repositories for electronic publications.

Coordinating research. Perhaps the most profound effect of Internet on science has been to usher in an era of cooperative science on a scale never seen before. In some areas of research, notably molecular biology, distribution of information over Internet has grown explosively. The most visible result is the appearance of international, public-domain databases such as Genbank and EMBL. As these databases become ever more enormous, working scientists are coming to rely on them as sources of reference. In molecular biology it is already standard practice to compare newly derived sequences against existing ones in the major databanks. Many journals (e.g. Nature) now demand that results be submitted to one of these network databases as a precondition for publication.

Contributing to network databases makes both economic and scientific sense. We cannot afford the luxury of carrying out research in piecemeal fashion. Given limited resources, it is essential to make maximum use of every piece of available data. Data that is used only once is like a disposable soft drink bottle - good things come out of it, but thereafter it is junk. The archives of the world's institutions are full of this refuse from uncoordinated research. Ideally, the results of every piece of research should not only answer an immediate question, but also contribute data to a larger scientific jigsaw. Many topical issues, such as biodiversity, are crying out for cooperative databases to support both research and decision-making.

Cooperative databases convey many advantages. Previously difficult studies become easy. Completely new kinds of study become possible and there is a significant serendipity effect that emerges as data are combined in new ways. For instance, comparative studies of molecular databases have already yielded new insights about gene families and the mechanisms of evolution. There is every reason to expect that databases in other fields of biology will prove equally as fruitful.

Potential network projects in Quaternary science. There are several possible kinds of public domain databases that could be set up on the network to serve Quaternary studies. They include

- compilations of useful software;
- electronic databases of scientists working in the field;
- pollen identification keys (including images);
- annotated bibliographies of relevant publications;
- abstracts of recent publications;
- compilations of data (e.g. complete pollen site records).

Each of the above kinds of database would contribute materially to Quaternary existing research projects.

Public domain databases usually conform to IAFA standards (Internet Anonymous FTP Archive). They are normally characterized by the following features:

COORDINATION - There is a controlling agency or organization that manages the database, receives and processes new entries, and communicates relevant news to its users.

PARTICIPATION - Anyone may contribute data to the database. Major databases announce new entries via special newsgroup or mailing lists.

ACCESS - Anyone may access, copy or use the database at any time. Normally access is via a computing network using a standard protocol.

STANDARDS - Contributors must use standard fields and attributes in submissions (e.g. Croft, 1989). This standard must be well-defined and should be publicized as widely as possible (see below). Usually it is expressed as a submission form (electronic, printed, or both) that is filled in by contributors.

FORMAT - Textual data (including bibliographies, mailing lists etc) are normally submitted and stored as ascii files in tagged field format (see Appendix). The database may be compressed, using standard utilities, to simplify network transfer. Images should be in one of the common formats in use, such as GIF (Graphic Interchange Format).

QUALITY CONTROL - Users need some guarantee that data provided in a database are both valid and accurate (Green, 1991, 1992). Quality control checks can be applied by database contributors, coordinators, or users - preferably all three.

ACKNOWLEDGEMENT - Every entry should include an acknowledgement of its contributor. This is essential to the notion that contributions are a form of publications.

AGREEMENTS - there should be an explicit list of terms and conditions that contributors and users must agree to. Notably, users agree to acknowledge the project and to waive liability for any use they make of the data. Contributors agree to place their data in the public domain.

LIFE at the Australian National University. The Australian National University Bioinformatics Facility provides a wide range of biological information and software through its Internet anonymous FTP archive:

site        life.anu.edu.au
login       anonymous
password    (your email address)
directory   /pub and its subdirectories
Current topics include biodiversity, bioinformation, complex systems, landscape ecology, molecular biology, and neurophysiology. For instance, we are developing prototype network information systems and protocols for the International Organization for Plant Information (IOPI), which aims to document the distributions of the world's plants.

Freely available to all pollen analysts, for instance, is the program POLSTA (/pub/biomathematics/polsta.zip), described in previous issues of this newsletter, which is an interactive PC package that provides tools for analysis and modelling of pollen time series.

References.

Deutsch, P. (1992). Publishing Information on the Internet with Anonymous FTP. IAFA DOC II.

Green, D.G. (in prep.). Databasing diversity - a distributed, public-domain approach. In preparation.

Krol, E. (1992). The Whole Internet. O'Reilly and Associates.

APPENDIX

APPENDIX

% ---------- < START : Cut here > ----------
   #####
% Part 1 CONTACT REGISTRATION 
% 
% Please complete this registration form about
%   the source of this dataset. 
% This information is needed for the following
%   reasons:
%   - identifying who contributed the dataset
%    - identifying who produced and/or main-
%        tains this dataset
%    - telling users whom to contact regarding
%        this dataset
%    - linking together information from the
%        same source
% 

SOURCE    Name of person or organization who
           produced the dataset
CONTACT   Name of person or organization to
           contact about the dataset
EMAIL     Electronic mail address for queries
           about the dataset
ADDRESS   Postal address for correspondence
           about this dataset
PHONE     International telephone number
FAX       International fax number

% Part 2 DATASET REGISTRATION 
%
% Please complete this registration form about
%   the data set.  This information is crucial
%   for the following
%    - Demonstrating the validity of the data 
%    - Defining the methodology & data lineage
%       for future users
%    - Identifying this study for all records

TITLE    Give a short descriptive name for the
          database. 
DATE     When was the dataset last revised?
PURPOSE  Why were the data collected? and how?
SOURCES  For compilations, indicate the orig-
          inal datasets
COMPILER Who is responsible for compiling/
          upkeeping the data? 
STANDARD What standard format (if any) does
          the dataset conforms to? 
          (e.g. Genbank entry)
PROGRAMS Name any special software used to
          read/manipulate the dataset. 
REFERENCES Give details of relevant publica-
          tions (e.g. methods, uses)
  AUTHOR       Name(s) of the author(s) 
  TITLE        Name of the book or article
  PUBLICATION  Details of book, journal,
                publisher, volume & pages. 
VALIDATION     What checks were applied to
                ensure that data are correct? 
ASSOCIATIONS   Name any other related data
                sets. (e.g. t001.dat etc)
COMMENTS       Mention any important issues
                not covered above. 

% Part 3 Methodology - repeat as many times as
%  necessary

TAXON    Start description of methods
CODE     Taxon code to use in the data records
FAMILY         
GENUS         
SPECIES               

% Part 4 DATA RECORDS - repeat as many times
%  as necessary

RECORD         Start of new record
DATE         
TAXA           List of taxon codes
SITES          Landscape units for this record

   #####

---------- < STOP : Cut here > ----------

Copyright © 1993 David G. Green
Home page
Newsletter 9 index
Author index
Subject index
WWW pages by K.D. Bennett