Information Handling in a Large Information System
P.R.P.CLARIDGE
The work described1 in this paper was undertaken in response to a need felt at the Low Temperature Research Station for a collection of information on chemical compounds found in edible plants. Enquiry showed that there would be widespread interest in a collection of this type. That the need was generally felt was confirmed by a reference made to the lack of collected information by the Development Commission (a).2 The last comprehensive publications of this type (b), (c) were published about 25 years ago, and many of the later publications contain little information of this kind.
Within the last few years, newer methods of analysis, such as chromatography, have facilitated the separation and identification of individual chemical compounds. In consequence, the volume of data being published is showing a rapid increase. In 1955 for example, about 2000 papers reporting definite chemical compounds (as distinct from more indefinite substances such as starch) in flowering plants were abstracted in Chemical Abstracts and Biological Abstracts.
Some of the papers [e.g., (d)] contain data for a number of plants. On the basis of this sample it was estimated that there would be of the order of 7 million entries (one chemical compound in one plant) in a collection of all the published data on the subject and that the annual rate of addition would exceed 100,000. If the definition of “plant” was to be widened to cover the whole vegetable kingdom and to include the lower plants (e.g., bacteria, yeasts, fungi), the collection would be even much larger than this. It was considered desirable to organize the collection so that this extension would be possible.
The collection was required not only to provide information on named
P.R.P.CLARIDGE Low Temperature Station for Research in Biochemistry and Biophysics, Department of Scientific and Industrial Research, University of Cambridge, Cambridge, England. Crown Copyright Reserved 18th March 1958
compounds in named plants, but also for various generic searches, such as for all compounds having specified groupings in common, and for partial specifications. Correlations between factors such as that between chemical composition and palatability were important. For this purpose the type of indexing to be used would have to be very detailed and constructed in such a way that the maximum amount of information could be displayed on the relations between the entries. A list was made of the various headings under which it was thought information should be indexed, and this was tested on a sample of approximately 2000 entries. The headings were revised (Fig. 1) on the basis of this trial. Subsequent smaller trials have suggested further improvements. There are other items which could advantageously be recorded even though they are not used for indexing the entry, such as language of original paper and type of study (e.g., experimental, comparative). A list of headings including such extra entries and a more adequate selection of types of source has been prepared for use in a larger scale trial of the whole system.
Economic retrieval of all the relevant information in the collection, in answer to an enquiry, was an essential requirement. The ability to select entries showing relationships not suspected at the time of entry of the data was desirable. No system can give out more information than has been put into it, so that the indexing system had to be such that as many implicit relationships as possible would be included when an entry was made. With these considerations in mind, a survey was made of possible indexing systems.
An alphabetical arrangement using plain language entries obviates the need for code books and provides implicit relationships through related meanings of the words used. Urquhart has shown (e) that information in alphabetical subject indexes can become “lost” in the sense that it is not retrieved by a searcher using the subject headings and guides. In his study more of the references looked for could be found by means of the author indexes than were retrieved by a subject approach. In this collection, not only would there be a large number of items to be indexed, but each of these items would have to be indexed from many aspects and at a number of levels of generality. Even the Index to Chemical Abstracts does not attempt to provide this last facility to any great extent; it is often necessary particularly for non-chemical entries to look up each member of the class when making a generic search. In an alphabetical subject index also it is difficult to search for information demanded under a partial specification and it was concluded that an alphabetical subject index to a collection of this nature and size was impracticable.
The need for generic searches suggested a classified system. Any form of generic relationship can be taken as the basis of the classification, but once this is chosen, generic searches on other bases have become impossible.
In the collection, there are four main aspects: botanical, chemical, functional (e.g., palatability), and miscellaneous (e.g., cultivation). Each of these could be indexed by its own classification. For the functional and miscellaneous aspects any classification would be arbitrary, but the number of headings is small enough for this to be unimportant. There would be no need to develop botanical and chemical classifications. Well-established classifications exist. If a classified system were to be used, however, the freedom to make generic searches by any criterion, which was a main requirement of the collection, would be lost, and it was concluded that an alphabetical, classified system would not be suitable.
Alphabetical systems not being suitable, the entries must be encoded in some way. This would facilitate machine handling which the size of the collection also indicated might be desirable.
In choosing a suitable notation3 in which to express the entries, the need for showing implicit relationships was kept well in mind. In entries made in plain language, the relationships between the words of the entry are expressed partly by special words of relation, partly by the order in which they are recorded, and partly by inflexion (i.e., special modifications) of the words used. If the “words” are reduced to single symbols, the last of these methods becomes synonymous with the use of a special “word” of relation. It should be possible to construct an artificial language, or notation, in which synonyms are rigorously excluded and in which the redundancy is reduced to a controlled amount. A separate sequence of symbols, or “word,” will be required for each concept to be expressed and for each relationship between the concepts. The idea of expressing relationships in this way has been proposed by Farradane (o), who employs special symbols for the purpose, and it is also included in the colon classification (p). If every “word” is to be expressed by a single digit, an impracticably large number of characters will be required. “Words” consisting of two or more characters must be made so as to reduce the number of characters to a usable level. Reduction in number of characters has to be paid for by the complication that the meaning of a character has become dependent on its context. To reduce this complication to a minimum, the number of characters used should be as large as possible. The characters used should be easily distinguishable and reproducible and be adaptable to manuscript writing. The range of characters found on a typewriter keyboard meets these requirements. Customarily these comprise upper and lower case alphabets, one range of numerals, a set of punctuation marks, and some special characters. A more suitable modification for chemical use is a set comprising upper and lower case
alphabets, numerals, subscript numerals, and punctuation marks. If the typewriter is fitted with accents, accented characters increase the range available, although these might be considered as double characters since two key movements are required to reproduce them.
The subject to be indexed would be described in this notation, element by element, by taking the elements in a standard order. The resulting cipher would be a unique representation of the subject and the same subject would always be represented by the same cipher. Common sequences in the two ciphers would represent common elements and relationships. However, the position of these sequences in the ciphers will in general not be the same, owing to the “grammatical rules.” The sequences will also owe their individuality to the order of the characters of which they are composed. In consequence, any system used for retrieval must be able to recognise that, for example “bad” is not equivalent to “dab.”
The problem of depicting chemical compounds so that they might be indexed properly and uniquely has been under study by a number of workers during the last decade. In 1949 the International Union of Pure and Applied Chemistry invited submission of codes for chemical compounds which satisfied their requirements (z). Of the systems submitted in response to this invitation, the Dyson system (aa) has been selected after extensive testing as the Proposed International System (bb). This notation uses a large number of characters (162 in all) and in general satisfies the desiderata set out above as desirable for an indexing notation. Some of these characters, (e.g., overlined characters and fractional subscripts) could perhaps be dispensed with for normal indexing and if this is done, the notation can be expressed in approximately 107 characters.
No similar notation has been developed for plants. These have been traditionally classified in a linear order according to their main features. There are a number of anomalies in the order, and constant changes are being made in an effort to minimise these. For the flowering plants, two classifications (f, g) have found general acceptance. Another (h) based on a somewhat different starting point has been proposed more recently. Sporne (l, m) has suggested an alternative system based on the probable evolution of the plants. Two proposals have been made to describe plants by a fixed serial number (j, k). For a study of the relationship between chemical composition and the taxonomic characters of plants, it would be necessary to express these characters in a notation of the type proposed above. The outline of such a notation has been developed and this will be developed further when the problems of machine handling have been brought nearer solution.
A number of the indexing and classification systems which have been pro-
posed as solutions to the problem of documenting large systems for information retrieval were examined to see whether they could be employed in this collection. All the coordinate systems, for example, are unsuitable since their principle of operation is conjunction of a number of headings of equal rank. In so doing, all order of the elements is destroyed.
Punched card machines of the conventional type also are not suitable for use with such notations for two reasons. First, they can be adapted to read more than 35–40 symbols only at considerable cost. Secondly, they normally read the cards broadside on, and it is necessary to specify the column in which a given code appears or to search the cards column by column. This repeated operation seriously reduces the rate of searching. What is required is a machine which reads each code in turn and selects only those items which contain the desired sequence. Such a machine was made for the purpose of handling Dyson’s chemical notation (n) and this was demonstrated by IBM in 1950. Unfortunately this prototype has not been developed further.
High operating speeds and great flexibility in operation are the characters which distinguish electronic computers. A computer is able to handle any type of notation and can be set to search for a complete specification in one pass. In addition, the calculating facilities built in enable a computer to work to alternative specifications in a way in which no other system can. Disadvantages of computers for information retrieval are their great cost and complexity and the large amount of detailed programming necessary before inserting or retrieving information. Also, in many machines input and output speeds are low in comparison with those of the calculating units. The basic needs of an information system are for a large store of data on which relatively little work will be done and for an output speed comparable with the rate at which the data searching is done, whereas the computer is most efficient when performing a number of sequential operations on the same data. A collection of the size and nature described above was likely to exceed the storage capacity of any computers existing when the survey was begun.
Microfilm rapid selectors, that is, electronic selectors in which the selection media are in microfilm form, have the advantage that they can produce the information corresponding to the search specification rather than the reference numbers of the documents which contain it. The film media are small and comparatively cheap to make and to store and are readily replicated. It seemed that further investigation of their possibilities was merited. A number of systems have been described. The Shaw rapid selector (u) redesigned and tested at the U.S. Department of Agriculture Library (t) was one of the earliest. The photoscopic information storage system (q) makes use of computer type circuits to analyse the information in the system and has the greatest density of
information in its storage. It is, however, difficult as yet to amend information once it has been recorded. The Minicard system (r, s) has a large coding area, and special attention has been given to reproduction of copies, addition of codes to later prints and to the inclusion of the maximum amount of information in clear. The system also includes the fullest range of handling machines and represents the most comprehensive attempt to provide for the needs of an information system. Unfortunately it appears (s) that the codes are not to be read from end to end of the card on each pass but row by row as is done in the case of IBM cards. Another selector is in course of development at Western Reserve University (v). Unfortunately, none of these systems is currently available in the United Kingdom, and it was impracticable to base a system on any of them. There was, however, one machine of this type, the Filmorex (w, x), which is of moderate price and is currently available. This system was thought worthy of further investigation. It was accordingly chosen for trial.
One of the problems which arises whenever non-redundant codes or notations are used is that of detecting and eliminating errors. If the data are reproduced mechanically once they have been recorded and checked, errors in transcription are minimised. This can be suitably done by recording the information on punched paper tape. The Flexowriter automatic typewriter (y), which in its simplest form is an electric typewriter with a tape-punch and tape-reader attached to it, enables this to be done. This model can punch selected portions of the information onto the paper tape as it is typed. The resulting tape can afterwards be used to operate the typewriter. The codes can be punched in a strip along one edge of a card, by a modified version of the punch, and these cards used instead of the tape to operate the reader. There is also a more complex model of the machine (the Programatic). In this model, codes can be punched in the tape which will cause the machine to switch the punch and/or the typewriter on and off, enabling extracts to be made of predetermined parts of the information recorded. Another model, intended primarily for personalised letter writing and similar uses, was provided with two readers and could be switched from one reader to the other by codes in the tapes. It was thought that a combination of these two features, the two inputs and the ability to control the operation of the machine by codes in the tape, would result in a very powerful and flexible machine. It was already standard for a tape in the reader to be used as a “programme tape” to instruct the machine to move to the position required, for the next fill-in on a form for example, and whether to punch that block of information into the output tape. The second reader and the “reader switch” facility would enable a “programme tape” to be used also for determining what action should be taken on each block of information bounded by two “reader switch” codes without this
action being predetermined before the information was first recorded. Not only can this “programme tape” determine whether the information is to be typed, punched, or ignored, but even the arrangement of the blocks of information on the page can be changed. For example, insertion of carriage return codes from the programme would arrange blocks of information one below the other, whereas previously they had followed one another on the same line.
Insertion of the currect notation would be ensured by preparing and checking in advance a master set of unit cards into which was “strip-punched”4 both the plain language and the notational equivalent of all the subject headings to be used. To make an entry, the prepunched card from the master set corresponding to the correct subject heading would be chosen and read in the first reader of the Flexowriter, which would be controlled in Duplex working by a programme tape in the second reader. The cards from the master set would be refiled immediately after use so as to be available for re-use when required. This procedure would ensure: (1) that only approved terms were used since unit cards are added to the master set only for approved subject headings; (2) that the plain language entry would be accompanied by the correct notational equivalent. If therefore a typescript and a record tape were made simultaneously, and the typescript were proofread to ensure that the correct plain language headings had been entered, the notation in the record tape must be correct. At a later stage, after any necessary further editing, a run of the record tape through the machine under control of an appropriate programme tape would result in a tape containing only the code entries of the notation. There was evidence that such a system could achieve good reliability.
A machine was made to this specification, but on trial it was found to have minor shortcomings which diminished its usefulness. For example it was found that in “non-print” the machine would respond only to a “print restore” code, with the result that it was impossible to type an extract of a record tape under the control of a programme tape since the “reader switch” codes (which switched the input from one reader to the other) were ignored. Another difficulty was that the “reader switch” codes were punched from both readers so that the number in the output tape was doubled with each pass through the Flexowriter under the control of a “programme tape” in the other reader. This made it impossible to use a “programme tape” to control the machine unless it was known beforehand how many times the information had already been through the machine. These and other minor points have been rectified, and a trial is in progress on the lines indicated.
One other modification to the standard machine deserves mention. The
standard codes for the paper tape are 5-unit, 6-unit, and 8-unit. The 5-unit code does not contain enough combinations. In the 6-unit tape, the upper and lower case characters differ only in that the former are preceded (at any distance back along the tape) by an upper case code, while a lower case code precedes the latter. In the standard Flexowriter 8-unit code, which was designed for ease of conversion of information on tape into IBM punched cards, the 6-unit code is used with an added “parity” punch in the 5th channel so that all the codes have an odd number of holes: the 8th channel is used only for the “carriage return” code. These codes were modified so that there should be a difference between upper and lower case characters in the combination being read by the reader. A punch in the 8th channel was added to all upper case characters, while this channel was blank for the lower case. This gave an 8-unit code, one channel of which was used only for parity check. The parity channel can be omitted later and a 7-unit code results. The number of non-zero combinations (127) is enough to provide a separate code for each character on the Flexowriter keyboard and one for each control signal. If the control codes were eliminated, there is accommodation for a further alphabet which could be represented in typescript by accented letters.
The products of the above procedure are: (1) an output tape in which is punched the notational equivalent of all the indexing entries to be made for a particular paper, together with (2) a typescript on which is typed the plain language of all these entries. These are to be passed to the Filmorex for use as follows: the tape to produce the perforated mask from which the code pattern is photographed; and the typescript, together with a suitable abstract of the paper, to be photographed as the pictorial portion of the Filmorex “fiche.”
A special conversion unit is required to link the Flexowriter and the Filmorex. This unit reads the codes in the Flexowriter output tape, recodes them as appropriate, and punches the new codes into the Filmorex perforated mask. By varying the connections in this unit, the coupling can be made flexible. The principle of operation of the Filmorex selector, of passing the cards in turn through a beam of light, in which is placed also a search specification card bearing the inverse of the pattern sought and of using the momentary “black out” of all light which occurs when a wanted card is read to operate the selection shutter by means of a photocell, imposes a limitation on the codes which can be used. Each pattern (in the standard Filmorex, one line of the coding area) read by a single photocell must have the same number of black spots (and of white spaces). With this limitation, the coding area, 30 units wide, can be divided into 6 fields each 5 units wide allowing a 6-digit number to be represented (2 punches out of 5 give 10 characters). The possible vocabulary size is 106 wrods. Alternatively a larger vocabulary (3.2×106 words) can be used
by dividing the line into 5 fields of 6 units each (punching 3 out of 6 gives 20 characters and 205 possible “words”). Since the output of the Flexowriter is 7-unit alphanumerical, it was thought preferable to divide the line into 4 digits of 7 units each, allowing 35 characters to be represented (punching 3 out of 7). The vocabulary is reduced somewhat below the maximum (to approximately 1.5×106) but greater flexibility is achieved. The 35 characters chosen are the numerals plus an alphabet, omitting I as likely to conflict with 1.
As the equipment stands, the system is simple and flexible. The codes are read in order on successive lines of the fiche, and the extensive presorting which the small, cheap fiche makes economic, speeds up the search by making it unnecessary in the majority of cases to search more than a small fraction of the file. The present selector has 5 reading heads and can be set to select various logical combinations of 5 code lines at one pass. It cannot distinguish the order in which these code lines occur.
If, however, the reading mechanism of the Filmorex were altered by the addition of further photocells, it would be possible to remove the restriction mentioned above on the code combinations which can be used, and the full theoretical total of 127 non-zero combinations possible for a 7-unit code could be used to accommodate 127 different characters or control signals. The versatility of the selector could be further increased by adding more logical circuits. By using the two spare units to enable the number of lines to be counted, this logical circuitry could distinguish the order of codes.
In order to allow a trial of the collection to be begun, it was decided to accept the present limitations of the Flexowriter and Filmorex and to try them in combination before attempting any further modifications, such as those outlined above. These can be worked on as the above trial is in progress. For work to commence, a botanical and chemical notation must be worked out which does not need more than 35 characters.
For the botanical entries, a system has been worked out. Zero was reserved for generality, and it was found that the whole of the plant kingdom could be accommodated (Table 1). For the flowering plants a more detailed classification has been made by using Willis’s system (i) as a basis. Nine alphanumeric digits are used. Of these the first three designate the family (Table 2), the next two the genus, and the remaining four the species and variety. So far some 5,000 species of flowering plants have been satisfactorily coded. For ease in recognition, a space is left after the third digit (the family) and a decimal point is inserted after the fifth (the genus), e.g., 365 . (Zeros are barred to distinguish them from the letter O.)
For the chemical compounds, several notations have been proposed which can be expressed within the limits of 35 characters. The Wiswesser system
TABLE 1. Plant classification: major groups
CRYPTOGAMIA |
|
Thallophyta |
|
Bacteria |
A |
Myxomycetes |
B |
Algae |
|
Chlorophyceae |
C-D |
Xanthophyceae |
E |
Bacilliarophyceae |
F |
Euglenineae |
G |
Phaeophyceae |
H |
Rhodophyceae |
I |
Cyanophyceae |
J |
Fungi |
|
Phycomycetes |
|
Oomycetes |
K |
Zygomycetes |
L |
Ascomycetes |
|
Endomycetales |
M-N |
(Yeasts as such) |
N |
Plectomycetes |
O |
Discomycetes |
P |
Pyrenomycetes |
Q |
Basidiomycetes |
|
Ustilaginales |
R |
Uredinales |
S |
Hymenomycetes |
T |
Gasteromycetes |
U |
Fungi imperfecti |
V |
Lichenes |
|
Ascolichenes |
W |
Bryophyta |
|
Hepaticae |
X |
Muscineae |
Y |
Pteridophyta |
Z |
PHANEROGAMIA |
|
Spermaphyta |
|
Gymnospermae |
110–140 |
Angiospermae |
|
Monocotyledonae |
170–200 |
Dicotyledonae |
|
Archichlamydeae |
300–700 |
Sympetalae |
860–900 |
(cc, dd, ee) was originally adapted to the slightly greater range of characters which a punched card machine can handle but has been developed with more characters into a notation such as was envisaged by IUPAC (z) and is designed for correlation and searching procedures. The Chemical-Biological Coordination Center developed a code (ff) for use with its work which is of a completely different character. In this the various component groupings are enumerated and no attempt is made to designate the complete compound with a unique cipher. The Centre National de la Recherche Scientifique has developed a
TABLE 2. Families of flowering plants and ferns (Willis)
Pteridophyta |
|
Cyatheaceae |
Z12 |
Equisetaceae |
Z31 |
Gleicheniaceae |
Z16 |
Hymenophyllaceae |
Z11 |
Isoetes |
Z99 |
Ligulatae |
Z50 |
Lycopodiaceae |
Z41 |
Marattiaceae |
Z23 |
Marsiliaceae |
Z21 |
Matoniaceae |
Z15 |
Ophioglossaceae |
Z24 |
Osmundaceae |
Z18 |
Parkeriaceae |
Z14 |
Polypodiaceae |
Z13 |
Psilotaceae |
Z71 |
Salviniaceae |
Z22 |
Schizaeaceae |
Z17 |
Gymnospermae |
|
Cycadaceae |
111 |
Ginkgoaceae |
121 |
Gnetaceae |
141 |
Pinaceae |
132 |
Taxaceae |
131 |
Monocotyledons |
|
Alismaceae |
185 |
Amaryllidaceae |
275 |
Aponogetonaceae |
183 |
Araceae |
241 |
Bromeliaceae |
259 |
Burmanniaceae |
291 |
Butomaceae |
186 |
Cannaceae |
283 |
Centrolepidaceae |
253 |
Commelinaceae |
261 |
Cyanastraceae |
263 |
Cyclanthaceae |
231 |
Cyperaceae |
212 |
Dioscoreaceae |
278 |
Eriocaulaceae |
256 |
Flagellariaceae |
251 |
Gramineae |
211 |
Haemodoraceae |
274 |
Hydrocharitaceae |
187 |
Iridaceae |
279 |
Juncaceae |
271 |
Lemnaceae |
242 |
Liliaceae |
273 |
Marantaceae |
284 |
Mayacaceae |
254 |
Musaceae |
281 |
Najadaceae |
182 |
Orchidaceae |
292 |
Palmae |
221 |
Pandanaceae |
172 |
Philydraceae |
264 |
Pontederiaceae |
262 |
Potamogetonaceae |
181 |
Rapateaceae |
258 |
Restionaceae |
252 |
Scheuchzeriaceae |
184 |
Sparganiaceae |
173 |
Stemonaceae |
272 |
Taccaceae |
277 |
Thurniaceae |
257 |
Triuridaceae |
191 |
Typhaceae |
171 |
Velloziaceae |
276 |
Xyridaceae |
255 |
Zingiberaceae |
282 |
Dicotyledons |
|
Acanthaceae |
951 |
Aceraceae |
657 |
Achariaceae |
741 |
Achatocarpaceae |
487 |
Actinidiaceae |
712 |
Adoxaceae |
974 |
Aextoxicaceae |
664 |
Aizoaceae |
488 |
Akaniaceae |
663 |
Alangiaceae |
778 |
Amarantaceae |
482 |
Anacardiaceae |
645 |
Ancistrocladaceae |
746 |
Anonaceae |
524 |
Apocynaceae |
915 |
Aquifoliaceae |
649 |
Araliaceae |
791 |
Aristolochiaceae |
461 |
Asclepiadaceae |
916 |
Balanophoraceae |
458 |
Balanopsidaceae |
361 |
Balsaminaceae |
667 |
Basellaceae |
492 |
Batidaceae |
391 |
Begoniaceae |
745 |
Berberidaceae |
518 |
Betulaceae |
421 |
Bignoniaceae |
934 |
Bixaceae |
729 |
Bombacaceae |
686 |
Boraginaceae |
924 |
Bretschneideraceae |
659 |
Brunelliaceae |
559 |
Bruniaceae |
563 |
Brunoniaceae |
993 |
Burseraceae |
622 |
Buxaceae |
641 |
Byblidaceae |
558 |
Cactaceae |
751 |
Callitrichaceae |
639 |
Calycanthaceae |
522 |
Calyceraceae |
995 |
Campanulaceae |
991 |
Canellaceae |
732 |
Capparidaceae |
532 |
Caprifoliaceae |
973 |
Caricaceae |
742 |
Caryocaraceae |
717 |
Caryophyllaceae |
494 |
Casuarinaceae |
311 |
Celastraceae |
651 |
Cephalotaceae |
555 |
Ceratophyllaceae |
512 |
Cercidiphyllaceae |
515 |
Chenopodiaceae |
481 |
Chlaenaceae |
682 |
Chloranthaceae |
323 |
Cistaceae |
728 |
Clethraceae |
861 |
Cneoraceae |
618 |
Cochlospermaceae |
731 |
Columelliaceae |
939 |
Combretaceae |
779 |
Compositae |
996 |
Connaraceae |
574 |
Convolvulaceae |
921 |
Coriariaceae |
643 |
Cornaceae |
793 |
Corynocarpaceae |
648 |
Crassulaceae |
554 |
Crossosomataceae |
572 |
Cruciferae |
533 |
Crypteroniaceae |
772 |
Cucurbitaceae |
981 |
Cunoniaceae |
561 |
Cynocrambaceae |
484 |
Cynomoriaceae |
787 |
Cyrillaceae |
646 |
Daphniphyllaceae |
638 |
Datiscaceae |
744 |
Desfontainiaceae |
913 |
Diapensiaceae |
866 |
Dichapetalaceae |
636 |
Diclidantheraceae |
896 |
Didieraceae |
661 |
Dilleniaceae |
711 |
Dipsacaceae |
976 |
Dipterocarpaceae |
723 |
Droseraceae |
543 |
Dysphaniaceae |
493 |
Ebenaceae |
892 |
Elaeagnaceae |
765 |
Elaeocarpaceae |
681 |
Elatinaceae |
724 |
Empetraceae |
642 |
Epacridaceae |
865 |
Ericaceae |
864 |
Erythroxylaceae |
616 |
Eucommiaceae |
565 |
Eucryphiaceae |
713 |
Euphorbiaceae |
637 |
Eupomatiaceae |
525 |
Fagaceae |
422 |
Flacourtiaceae |
735 |
Fouquieraceae |
727 |
Frankeniaceae |
725 |
Garryaceae |
341 |
Geissolomataceae |
761 |
Gentianaceae |
914 |
Geraniaceae |
611 |
Gesneriaceae |
938 |
Globulariaceae |
942 |
Gomortegaceae |
527 |
Gonystilaceae |
683 |
Goodeniaceae |
992 |
Grubbiaceae |
454 |
Guttiferae |
722 |
Gyrostemonaceae |
486 |
Haloragidaceae |
785 |
Hamamelidaceae |
564 |
Hernandiaceae |
52x |
Heteropyxidaceae |
766 |
Himantandraceae |
513 |
Hippocastanaceae |
658 |
Hippocrateaceae |
652 |
Hippuridaceae |
786 |
Hoplestigmataceae |
734 |
Humiriaceae |
615 |
Hydnoraceae |
463 |
Hydrocaryaceae |
783 |
Hydrophyllaceae |
923 |
Hydrostachyaceae |
553 |
Icacinaceae |
656 |
Juglandaceae |
381 |
Julianaceae |
411 |
Labiatae |
926 |
Lacistemaceae |
324 |
Lactoridaceae |
523 |
Lardizabalaceae |
517 |
Lauraceae |
529 |
Lecythidaceae |
775 |
Leguminosae |
575 |
Leitneriaceae |
371 |
Lennoaceae |
863 |
Lentibulariaceae |
941 |
Limnanthaceae |
644 |
Linaceae |
614 |
Lissocarpaceae |
895 |
Loasaceae |
743 |
Loganiaceae |
912 |
Loranthaceae |
457 |
Lythraceae |
771 |
Magnoliaceae |
521 |
Malesherbiaceae |
738 |
Malpighiaceae |
631 |
Malvaceae |
685 |
Marcgraviaceae |
718 |
Martyniaceae |
936 |
Medusagynaceae |
716 |
Melastomataceae |
782 |
Meliaceae |
623 |
Melianthaceae |
666 |
Menispermaceae |
518 |
Monimiaceae |
528 |
Moraceae |
432 |
Moringaceae |
536 |
Myoporaceae |
953 |
Myricaceae |
351 |
Myristicaceae |
526 |
Myrothamnaceae |
562 |
Myrsinaceae |
872 |
Myrtaceae |
781 |
Myzodendraceae |
451 |
Nepenthaceae |
542 |
Nolanaceae |
931 |
Nyctaginaceae |
483 |
Nymphaeaceae |
511 |
Nyssaceae |
777 |
Ochnaceae |
714 |
Octoknemataceae |
456 |
Olacaceae |
455 |
Oleaceae |
911 |
Oliniaceae |
763 |
Onagraceae |
784 |
Opiliaceae |
453 |
Orobanchaceae |
937 |
Oxalidaceae |
612 |
Pandaceae |
581 |
Papaveraceae |
531 |
Passifloraceae |
739 |
Pedaliaceae |
935 |
Penaeaceae |
762 |
Pentaphylacaceae |
647 |
Phrymaceae |
954 |
Phytolaccaceae |
485 |
Piperaceae |
322 |
Pittosporaceae |
557 |
Plantaginaceae |
961 |
Platanaceae |
571 |
Plumbaginaceae |
881 |
Podostemaceae |
551 |
Polemoniaceae |
922 |
Polygalaceae |
635 |
Polygonaceae |
471 |
Portulacaceae |
491 |
Primulaceae |
873 |
Proteaceae |
441 |
Punicaceae |
774 |
Pyrolaceae |
862 |
Quiinaceae |
719 |
Rafflesiaceae |
462 |
Ranunculaceae |
516 |
Resedaceae |
535 |
Rhamnaceae |
671 |
Rhizophoraceae |
776 |
Rosaceae |
573 |
Rubiaceae |
971 |
Rutaceae |
619 |
Sabiaceae |
665 |
Salicaceae |
331 |
Salvadoraceae |
653 |
Santalaceae |
452 |
Sapindaceae |
662 |
Sapotaceae |
891 |
Sarraceniaceae |
541 |
Saururaceae |
321 |
Saxifragaceae |
556 |
Scrophulariaceae |
933 |
Scytopetalaceae |
688 |
Simarubaceae |
621 |
Solanaceae |
932 |
Sonneratiaceae |
773 |
Stachyuraceae |
736 |
Stackhousiaceae |
654 |
Staphyleaceae |
655 |
Sterculiaceae |
687 |
Strasburgeriaceae |
715 |
Stylidiaceae |
994 |
Styracaceae |
894 |
Symplocaceae |
893 |
Tamaricaceae |
726 |
Theaceae |
721 |
Theophrastaceae |
871 |
Thymelaeaceae |
764 |
Tiliaceae |
684 |
Tovariaceae |
534 |
Tremandraceae |
634 |
Trigoniaceae |
632 |
Tristichaceae |
552 |
Trochodendraceae |
514 |
Tropaeolaceae |
613 |
Turneraceae |
737 |
Ulmaceae |
431 |
Umbelliferae |
792 |
Urticaceae |
433 |
Valerianaceae |
975 |
Verbenaceae |
925 |
Violaceae |
733 |
Vitaceae |
672 |
Vochysiaceae |
633 |
Zygophyllaceae |
617 |
system on somewhat the same lines as the CBCC for use in its Filmorex installation (gg) and it is proposed to adopt this for the first trial run.
Other properties, e.g., palatability, medicinal effects, texture, conditions of growth, susceptibility to diseases, are of importance in determining the economic value and use which can be made of plants. These properties are difficult in many cases to describe on a numerical or other linear scale. The number of possible headings for each is small, however, and coding in a maximum of two digits is possible in a number of ways, within the range of 35 characters.
At the time when this paper was proposed, it was thought that all the above experimental work could be reported on. Owing to delays in delivery of equipment and, in particular, to a serious accident which befell the author, it has not proved possible to include the results. The present status of the work is: the Flexowriter has been tried on all the procedures, and, subject to the modifications outlined, has proved itself satisfactory; the detailed design of the conversion unit has been completed and construction is due to begin. The delivery of the Filmorex equipment is expected in the near future. By autumn 1958 some results should be available and these will be reported in due course.
Summary
Some serious limitations of existing methods of indexing and cataloguing scientific information became apparent when the possibility was being explored of setting up a large detailed system which would answer enquiries on the pub-
lished information on the chemical compounds found in plants. It was estimated that this system would need to contain of the order of 107 items, each of which might be sought from any one of four aspects, namely chemical, botanical, functional, and miscellaneous. By functional is meant such characters as palatability, pharmaceutical effects, and toxicity while the miscellaneous aspect includes such factors as cultivation and place of growth. The problem was complicated by a need for retrieval when given partial specifications, for example plants containing chemical compounds which have certain groupings in common. Such a system cannot be handled by any of the existing classical methods.
The solution proposed, which is applicable to any system of large size which has to be indexed in detail, is to express each of the factors in a notation in which a linear series of symbols expresses the factor element by element and in which the relationship between the elements is expressed partly by the order of these symbols and partly by special symbols of relationship. In the example mentioned above, notations for expressing the chemical aspect have been proposed, the botanical aspect has been extensively classified by taxonomic characters although no notation of this type has been formulated, while the functional aspect is not classified.
Use of such a notation on the scale envisaged presents problems of machine design. Although some machines work satisfactorily on a binary system, a range of symbols expressed in binary form is not convenient for manual handling (e.g., compilation of codes and entry of the information into the system). For this, use of the maximum number of symbols is desirable. A compromise has been adopted with a range of symbols as large as can be accommodated on a typewriter keyboard, that is to say, two alphabets, two ranges of numerals, and a full set of punctuation marks, which can be converted by a modified model of a tape punching typewriter into a seven-bit code.
Having recorded the information in this way and having checked it for accuracy, further handling of the information can be done by machine and further checking should be unnecessary. Various devices can be used to facilitate this initial checking.
Selection again presents a problem. The standard punched card machine, which immediately comes to mind as a possible way of mechanical selection, is unsuited to seven-bit codes and even more unsuited to notations of the type proposed in which the position of the group to be searched for cannot be specified. On the other hand, electronic computers, which are ideal for handling binary information, have memories which are several orders too low and have the additional disadvantage of being unnecessarily expensive.
The selection can be done by a rapid selector type of equipment modified so
as to accept punched tape (the output from a tape punching typewriter) as its input. In this machine the indexing entries recorded in binary code are scanned term by term so that the position of the group for which search is being made is immaterial.
There are three points of novelty in the process. First is the development of suitable notations and their adaptation to machine limitations and use. Second is the modification to the tape punching typewriter to give it the necessary versatility and flexibility. Third is the adaptation of the Filmorex to accept codes of this type and the design and construction of the converters needed to transfer the information from one machine to the next.
REFERENCES
(a) Survey of Agricultural, Forestry and Fishery Products in the United Kingdom. Development Commission, London, 1953.
(b) KLEIN, G. Handbuch der Pflanzenanalyse. Springer, Wien, 1931–2.
(c) WEHMER, C. Die Pflanzenstoffe. Verlag Fischer, Jena, 1929–30.
(d) WALL, E.M., FENSKE, C.S., WILLAMAN, J.J., CORRELL, D.S., SCHUBERT,B. G. and GENTRY, H.S. Steroidal sapogenins XXVI. Supplementary table of data for steroidal sapogenins XXV. September 1955. U.S. Dept. Agr., Agric. Research Service, ARS-73–4.
(e) URQUHART, D.J. Unanswered Questions No. 5. Dept. Scientific Industrial Research, London, October 1951, pp. 1–3.
(f) BENTHAM, G. and HOOKER, J.D. Genera Plantarum, London, 1862.
(g) ENGLER, A., and PRANDTL. Die Natürlichen Pflanzenfamilien, Leipzig.
(h) HUTCHINSON, J. Families of Flowering Plants: Vol. I, Dicotyledons; Vol. II, Monocotyledons. Macmillan, London, 1926/34.
(i) WILLIS, J.C. Flowering Plants and Ferns, 6th revised edition. Cambridge University Press, 1951.
(j) MULLINS, L.J., and NICKERSON, W.J. A proposal for serial number indentification of biological species. Chronica botan., 12, 4 (1951).
(k) GOULD, S.W. Permanent numbers to supplement the binomial system of nomenclature. Am. Scientist, 42, 269–74 (1954).
(l) SPORNE, K.R. Statistics and the evolution of dicotyledons, Evolution, 8 [1], 55–64 (1954).
(m) SPORNE, K.R. The phylogenetic classification of the Angiosperms. Biol. Revs., 31, 1–29 (1956).
(n) DYSON, G.M. Studies in chemical documentation III. Mechanized documentation. Chem. & Ind., 1954 (April 17), 400–9.
(o) FARRADANE, J.E.L. A scientific theory of classification and indexing and its practical applications. J. Document. 6, 83–99 (1950); A scientific theory of classification and indexing: further considerations. J. Document., 8(2), 73–82 (1952).
(p) RANGANATHAN, S.R. Annals of Library Science.
(q) KING, G.W. A new approach to information storage. Control Eng., August 1955.
KING, G.W., BROWN, G.W., and RIDENOUR, L.N. Photographic techniques for information storage. Proc. I.R.E., 41(10), 421–8 (1953).
ANON. Photoscopic information storage. International Telemeter Corporation, Los Angeles, Calif., Publication R-77, March 15, 1955.
(r) TYLER, A.W., MYERS, W.L., and KUIPERS, J.W. The application of the Kodak Minicard system to problems of documentation. Am. Document., 6(1), 18–30 (1955).
KUIPERS, J.W., TYLER, A.W., and MYERS, W.L. A Minicard system for documentary information. Am. Document., 8(4), 246–68 (1957).
(s) Minicard demonstration. Am. Document., 6(4), 258–9 (1955).
(t) Report for the microfilm rapid selector. Eng. Research Assoc., Inc., No. PB 97313. U.S. Dept. Commerce, 1949.
(u) SHAW, R.R. The rapid selector. J. Documentation, 5, 164–71 (1949).
(v) The Western Reserve Searching Selector. Am. Document, 8(3), 237–8 (1957).
(w) SAMAIN, J. Progres du classement et de la selection mechanique des documents. 17 Conf. FID Berne, August 1947, pp. 22–26.
(x) SAMAIN, J. The organization of documentation by the Filmorex technique. Filmorex, Paris, 1956.
(y) BROWN, R.HUNT, Editor, Office Automation. Automation Consultants, Inc., New York, 1955, pp. 51–7.
(z) Codes invited. Chem. Eng. News, 27, 2998 (1949).
(aa) DYSON, G.M. A New Notation and Enumeration System for Organic Compounds, 2nd edition. Longmans Green, London, 1949.
(bb) DYSON, G.M. Private communication.
(cc) WISWESSER, W.J. Simplified chemical coding for automatic sorting and printing machinery. Willson Products Inc., Reading, Pa., 1951.
(dd) WISWESSER, W.J. The Wiswesser line formula notation. Chem. Eng. News, 30, 3525–6 (1952).
(ee) WISWESSER, W.J. A Line Formula Chemical Notation. Crowell Co., New York, 1954.
(ff) CHEMICAL-BIOLOGICAL COORDINATION CENTER. A method for coding chemicals for correlation and classification. National Research Council, Washington, D.C. 1950.
(gg) SAMAIN, J. Personal communication.