Read "Proceedings of the International Conference on Scientific Information: Two Volumes" at NAP.edu

Page 903 Cite

Suggested Citation:"Subject-Word Letter Frequencies with Applications to Superimposed Coding." National Research Council. 1959. Proceedings of the International Conference on Scientific Information: Two Volumes. Washington, DC: The National Academies Press. doi: 10.17226/10866.

×

Subject-Word Letter Frequencies with Applications to Superimposed Coding

HERBERT OHLMAN

ABSTRACT. The frequencies of occurrence of English letters in the first five positions of subject words and proper names are determined. With these frequencies a superimposed code is designed. No code book is required. Coding space is utilized almost as economically as with a random code. An empirical check is made. A quantitive measure of word popularity is proposed using letter-frequency data.

Coding, or the transforming of information from one guise to another, is one of man’s commonest activities. Every picture may be said to be a coding of some real scene and every written word a coding of some utterance—the brain itself is said to work with coded impulses.

Since the beginning of mass communications, starting with the invention of printing, and increasing with the widespread use of electronics, efficient use of existing space and time has become more and more important. Today, information theory provides a sound basis for determining the limits of transmission speed and accuracy. However, Shannon’s theory (1) does not tell us how to make a particular code more efficient. The design of codes is still an art; this paper deals with the improving of one particular type, superimposed coding. In information searching, mechanical aids are being used wherever possible. For a machine to process information, the information must be coded, usually into some variant of that most basic code of all, the binary. However, the most efficient code for pure selection appears to be a superimposed random code. Each coding position is used in a random manner, and a group of coding positions contain superimposed entries.

Calvin Mooers (2, 4) calls such coding “Zatocoding” and has applied it in his patented marginal-punched card system called Zator. However, Zatocoding requires an intermediate step in both coding and searching—a code

HERBERT OHLMAN System Development Corp., Santa Monica, Calif.

Page 904 Cite

Suggested Citation:"Subject-Word Letter Frequencies with Applications to Superimposed Coding." National Research Council. 1959. Proceedings of the International Conference on Scientific Information: Two Volumes. Washington, DC: The National Academies Press. doi: 10.17226/10866.

×

FIGURE 1.

Page 905 Cite

Suggested Citation:"Subject-Word Letter Frequencies with Applications to Superimposed Coding." National Research Council. 1959. Proceedings of the International Conference on Scientific Information: Two Volumes. Washington, DC: The National Academies Press. doi: 10.17226/10866.

×

book containing a number of indexing terms with random-number equivalents.

Carl Wise (3, 4) has produced a nonrandom superimposed code which he calls “word coding” for use with marginal-punched cards of the Keysort variety. This type of code does not require an intermediate code book.

The author has attempted to combine the best features of both systems in coding English words by essentially pre-randomizing the alphabet. This is possible because there is a certain invariance of the letter frequencies within each letter position of a word.

As this system was developed in response to a specific need, it may be well to talk about it in concrete terms, and later apply its principles to other information systems. A marginal-punched card produced on IBM equipment (5) was used as the unit record. The thirty-eight positions along the top edge could code 38 words or phrases by using a direct code, but by superimposition every position could be made to do multiple duty. However, neither of the two systems previously described seemed to meet the requirement of a directly interpretable, yet efficient code.

Subject-word and proper-name lists were studied to find what letter frequencies occurred in the first five letter positions. Some work along these lines had been done, notably by Geisler for the ASM-SLA (6) with proper names, and by Krieger (7) with subject words (however, Krieger only considered initial letters in designing his code).

Striking similarities for initial-letter frequencies among various subject-word lists were found, as shown in Table 1. The average of five such lists show that 40% of the words begin with C, S, P, or A (in that order). Furthermore, 85% begin either with these four or B, M, T, R, E, F, D, G, H, or I—or only 54% of the alphabet.

Even greater consistency was found with proper-name lists, as shown in Table 2, but with a different ranking of the letters. The average of three such lists gave S, B, M, H, and C for the beginning letters of 40% of the names, and these five and D, G, K, L, R, P, W, A, and F (again 54% of the alphabet) accounted for 83%.

The Library of Congress list was chosen as typical of the subject-word lists, and the 1955 Syracuse Telephone Directory as typical of names. A systematic sample was obtained from each list by recording the top-left, middle, and top-right terms from every two-page spread.¹ The frequencies of letters in each of the first five positions were then obtained for each list, as shown in Tables 3 and 4.

¹	The probabilities in this case are not independent, but every term is equidistant in the alphabetical sequence from the next term chosen, which is a sufficient approximation to true randomness for the purposes of this study.

Page 906 Cite

Suggested Citation:"Subject-Word Letter Frequencies with Applications to Superimposed Coding." National Research Council. 1959. Proceedings of the International Conference on Scientific Information: Two Volumes. Washington, DC: The National Academies Press. doi: 10.17226/10866.

×

TABLE 1 Subject-word initial letter frequencies^a

	Chambers Technical Dict’y, 1942, 912 pp.		Merriam-Webster Unabridged Dict’y, 2987 pp.		Industrial Arts Index (Vol. 41, No. 5), April, 1953, 787 pp.		Chem. Abstracts Decennial Subject Index 1907–16 & 1927–36 (after Krieger, ref. 7)		Lib. of Congress T.I.D. List of Subject Headings, June 1952, 327 pp.		Average Frequency, %	Rank
Letter	Freq., %	Deviation	Freq., %	Deviation	Freq., %	Deviation	Freq., %	Deviation	Freq., %	Deviation	Average Frequency, %	Rank
A	7.3	−1.0	6.6	−1.7	9.3	+1.0	8.6	+0.3	9.7	+1.4	8.3	4
B	6.2	+0.4	5.7	−0.1	4.5	−1.3	7.4	+1.6	4.9	−0.9	5.8	5
C	10.8	+0.4	9.8	−.06	10.0	−0.4	12.6	+2.2	8.8	−1.6	10.4	1
D	5.7	+1.5	4.9	+0.7	3.3	−0.9	3.5	−0.7	3.6	−0.6	4.2	11
E	4.7	+0.3	3.3	−1.1	6.1	+1.7	3.9	−0.5	4.2	−0.2	4.4	9
F	4.6	+0.3	3.9	−0.4	3.7	−0.6	4.9	+0.6	4.5	+0.2	4.3	10
G	3.7	−0.2	3.2	−0.7	4.6	+0.7	3.7	−0.2	4.5	+0.6	3.9	12
H	4.4	+0.6	3.7	−0.1	3.3	−0.5	4.3	+0.5	3.6	−0.2	3.8	13.5
I	3.1	−0.7	3.0	−0.8	6.0	+2.2	3.7	−0.1	3.2	−0.6	3.8	13.5
J	0.6	0.0	0.9	+0.3	0.2	−0.4	~0.4	−0.2	1.7	+0.1	0.6	21.5
K	1.0	+0.4	0.9	+0.3	0.4	−0.2	~0.4	−0.2	0.3	−0.3	0.6	21.5
L	3.8	+0.7	3.2	+0.2	2.5	−0.6	3.3	+0.2	2.6	−0.5	3.1	15
M	5.6	−0.1	5.0	−0.7	6.6	+0.9	5.3	−0.4	6.2	+0.5	5.7	6
N	2.0	−0.5	1.8	−0.7	2.4	−0.1	3.1	+0.6	3.2	+0.7	2.5	16
O	2.2	0.0	2.6	+0.4	1.7	−0.5	2.7	+0.5	1.9	−0.3	2.2	18
P	9.1	−0.4	9.3	−0.2	10.3	+0.8	10.8	+1.3	8.1	−1.4	9.5	3
Q	0.6	+0.1	0.6	−0.1	0.2	−0.3	~0.7	+0.2	0.3	−0.2	0.5	23
R	4.4	−0.4	4.8	0.0	4.7	−0.1	3.5	−1.3	6.8	+2.0	4.8	8
S	10.4	+0.3	12.4	+2.3	9.9	−0.2	8.4	−1.7	10.4	+0.3	10.1	2
T	4.8	−0.5	6.3	+1.0	5.1	−0.2	4.3	−1.0	6.2	+0.9	5.3	7
U	0.8	−0.4	1.9	+0.7	0.9	−0.3	~0.7	−0.5	1.9	+0.7	1.2	20
V	1.6	+0.2	1.7	+0.3	1.1	−0.3	1.4	0.0	1.3	−0.1	1.4	19
W	1.8	−0.5	3.3	+1.0	2.4	+0.1	1.9	−0.4	1.9	−0.4	2.3	17
X	0.2	0.0	0.1	−0.1	0.2	0.0	~0.03	−0.2	0.3	+0.1	0.2	26
Y	0.2	−0.1	0.4	+0.1	0.1	−0.2	~0.03	−0.3	0.3	0.0	0.3	24.5
Z	0.4	+0.1	0.3	0.0	0.2	−0.1	~0.03	−0.3	0.3	0.0	0.3	24.5
Check sum	100.0	+0.5	99.6	0.0	99.7	+0.2	99.6	0.0	99.7	+0.2	99.5
^a Frequencies which deviate more than 1 % from the average are shown in italics.

Page 907 Cite

Suggested Citation:"Subject-Word Letter Frequencies with Applications to Superimposed Coding." National Research Council. 1959. Proceedings of the International Conference on Scientific Information: Two Volumes. Washington, DC: The National Academies Press. doi: 10.17226/10866.

×

TABLE 2 Proper-name initial letter frequencies^a

	Chemical Abstracts Fourth Decennial Author Index, 3531 pp.		ASM-SLA Metal Literature Study, 4870 pp.		Syracuse, N.Y., Telephone Directory, 1955, 307 pp.		Av. Freq., %	Rank
Letter	Freq., %	Deviation	Freq., %	Deviation	Freq., %	Deviation	Av. Freq., %	Rank
A	3.8	0.0	2.45	−1.35	5.2	+1.4	3.8	13
B	9.2	−0.2	10.2	+0.8	8.8	−0.6	9.4	2
C	5.7	−0.9	6.2	−0.4	7.8	+1.2	6.6	5
D	5.0	−0.3	5.3	0.0	5.5	+0.2	5.3	6.5
E	2.3	+0.1	2.25	+0.05	2.0	−0.2	2.2	16
F	3.6	0.0	3.4	−0.2	3.9	+0.3	3.6	14
G	5.3	0.0	5.6	+0.3	4.9	−0.4	5.3	6.5
H	6.7	−0.1	7.35	+0.55	6.2	−0.6	6.8	4
I	1.6	+0.7	0.75	−0.15	0.3	−0.6	0.9	21.5
J	1.8	+0.1	1.75	+0.05	1.6	−0.1	1.7	19
K	6.0	+1.2	4.9	−0.3	4.6	−0.6	5.2	8
L	4.6	−0.4	5.65	+0.65	4.6	−0.4	5.0	9
M	7.7	−0.6	8.25	−0.05	8.8	+0.5	8.3	3
N	2.3	+0.3	1.8	−0.2	2.0	0.0	2.0	17
O	1.4	−0.2	1.4	−0.2	2.0	+0.4	1.6	20
P	4.6	−0.1	4.5	−0.2	4.9	+0.2	4.7	11.5
Q	0.1	−0.1	0.1	−0.1	0.3	+0.1	0.2	25
R	5.0	+0.1	4.65	−0.25	4.9	0.0	4.9	10
S	11.3	+0.1	11.0	−0.2	11.4	+0.2	11.2	1
T	3.4	+0.2	3.65	+0.45	2.6	−0.6	3.2	15
U	0.7	+0.2	0.45	−0.05	0.3	−0.2	0.5	23
V	1.8	−0.1	2.15	+0.25	1.6	−0.3	1.9	18
W	4.6	−0.1	4.65	−0.15	4.9	+0.2	4.7	11.5
X	0.0	0.0	0.0	0.0	0.0	0.0	0.0	26
Y	0.5	+0.1	0.5	+0.1	0.3	−0.1	0.4	24
Z	1.0	+0.1	1.1	+0.2	0.7	−0.2	0.9	21.5
Check sum	100.0	+0.1	99.55	−0.4	100.1	−0.2	100.3
^a Frequencies which deviate more than 1% from the average are shown in italics.

Page 908 Cite

Suggested Citation:"Subject-Word Letter Frequencies with Applications to Superimposed Coding." National Research Council. 1959. Proceedings of the International Conference on Scientific Information: Two Volumes. Washington, DC: The National Academies Press. doi: 10.17226/10866.

×

TABLE 3 Subject-word letter frequencies (332 words)^a

	First letter		Second letter		Third letter		Fourth letter		Fifth letter
Letter	Freq., %	Rank	Freq., %	Rank	Freq., %	Rank	Freq., %	Rank	Freq., %	Rank
A	9.3	2	17.8	1	8.4	2	7.7	3	5.3	9.5
B	4.8	8	0.6	17	2.7	15	2.2	17	0.0	24.5
C	8.1	3	1.8	12	5.4	9	6.2	5	2.3	11.5
D	3.6	12.5	0.3	21.5	6.3	6.5	5.0	9	2.0	13
E	4.2	11	12.3	2	6.3	6.5	11.8	1	13.3	1
F	4.5	9.5	0.3	21.5	2.4	16	1.2	20	0.7	19
G	4.5	9.5	0.3	21.5	1.8	17	0.9	22	0.7	19
H	3.6	12.5	3.9	9.5	0.9	19	3.7	13.5	1.3	15
I	3.3	14.5	11.1	4	5.2	10	10.8	2	9.3	4
J	0.9	21	0.0	25.5	0.0	25.5	1.2	20	0.0	24.5
K	0.6	23.5	0.3	21.5	0.3	22.5	2.8	15.5	0.3	21.5
L	2.7	16	6.9	7	7.8	5	3.9	11.5	6.7	7
M	6.3	6	0.9	14.5	3.9	12.5	5.3	8	2.3	11.5
N	3.3	14.5	3.9	9.5	5.7	8	5.9	7	7.3	6
O	2.1	17.5	11.4	3	8.1	3.5	6.2	5	11.7	3
P	7.8	4	1.5	13	3.9	12.5	3.7	13.5	1.0	16.5
Q	0.6	23.5	0.3	21.5	0.3	22.5	0.3	24	0.0	24.5
R	6.6	5	7.5	6	12.0	1	3.9	11.5	12.0	2
S	9.9	1	0.6	17	4.5	11	4.3	10	5.7	8
T	6.0	7	2.4	11	8.1	3.5	6.2	5	8.7	5
U	1.8	19	7.8	5	3.3	14	2.8	15.5	5.3	9.5
V	1.5	20	0.3	21.5	0.6	20	1.2	20	0.3	21.5
W	2.1	17.5	0.0	25.5	0.3	22.5	0.3	24	0.7	19
X	0.6	23.5	0.9	14.5	0.3	22.5	0.0	26	1.0	16.5
Y	0.3	26	6.0	8	1.2	18	1.9	18	1.7	14
Z	0.6	23.5	0.6	17	0.0	25.5	0.3	24	0.0	24.5
No. of blanks	0		0		0		9		33
Check sum	99.6		99.7		99.7		99.5		99.6
^a Blanks are not counted in computing percentages.

Page 909 Cite

Suggested Citation:"Subject-Word Letter Frequencies with Applications to Superimposed Coding." National Research Council. 1959. Proceedings of the International Conference on Scientific Information: Two Volumes. Washington, DC: The National Academies Press. doi: 10.17226/10866.

×

TABLE 4 Proper name letter frequencies (309 names)a

	First letter		Second letter		Third letter		Fourth letter		Fifth letter
Letter	Freq., %	Rank	Freq., %	Rank	Freq., %	Rank	Freq., %	Rank	Freq., %	Rank
A	3.2	14	21.7	1	6.8	5	7.0	5.5	6.5	8.5
B	8.8	3	0.3	19	2.3	14	4.3	10	1.5	16
C	7.8	4	3.9	7	2.9	12	3.6	12.5	2.9	11.5
D	5.5	6	0.6	15.5	2.0	16.5	6.0	8	1.5	16
E	2.0	17	14.3	2	7.1	4	11.6	1	19.2	1
F	4.2	13	0.0	23.5	1.3	20	0.7	23	0.7	20
G	4.9	9	0.6	15.5	3.9	10	2.6	14.5	2.9	11.5
H	6.2	5	3.6	8	2.3	14	2.6	14.5	6.5	8.5
I	0.6	22	10.0	4	4.9	8	8.0	3	6.9	6
J	1.6	19	0.0	23.5	0.3	25	0.0	25.5	0.4	23
K	4.5	11.5	0.0	23.5	1.3	20	3.6	12.5	3.3	10
L	4.5	11.5	3.2	9.5	11.0	3	9.3	2	7.6	3
M	9.4	2	1.6	13	0.6	23.5	1.3	19.5	2.2	14
N	2.0	17	2.3	12	12.0	2	7.6	4	6.9	6
O	2.0	17	16.8	3	6.5	6	4.3	10	10.1	2
P	4.9	9	0.6	15.5	1.6	18	1.7	18	0.7	20
Q	0.3	24.5	0.0	23.5	0.0	26	0.0	25.5	0.0	25
R	4.9	9	6.5	6	12.6	1	6.3	7	7.2	4
S	11.7	1	0.3	19	4.9	8	4.3	10	6.9	6
T	2.9	15	3.2	9.5	4.9	8	7.0	5.5	2.5	13
U	0.6	22	6.8	5	3.6	11	2.3	16.5	0.7	20
V	1.3	20	0.3	19	1.0	22	2.3	16.5	0.0	25
W	5.2	7	0.6	15.5	2.3	14	1.0	21.5	1.5	16
X	0.0	26	0.0	23.5	0.6	23.5	0.3	24	0.0	25
Y	0.3	24.5	2.6	11	2.0	16.5	1.3	19.5	0.7	20
Z	0.6	22	0.0	23.5	1.3	20	1.0	21.5	0.7	20
No. of blanks	0		0		0		6		32
Check sum	99.9		99.8		100.0		100.0		100.0
^a Blanks are not counted in computing percentages.

Page 910 Cite

Suggested Citation:"Subject-Word Letter Frequencies with Applications to Superimposed Coding." National Research Council. 1959. Proceedings of the International Conference on Scientific Information: Two Volumes. Washington, DC: The National Academies Press. doi: 10.17226/10866.

×

TABLE 5 Amount of information (H) in subject-word letter^a

	First letter			Second letter			Third letter			Fourth letter			Fifth letter			Av. English text (after Pratt (9))
Rank,n		pn	-pnlog2pn		pn	-pnlog2pn		pn	pnlog2pn		pn	pnlog2pn		pn	-pnlog2pn		pn	-pnlog2pn
1	S	.099	.3303	A	.178	.4432	R	.120	.3671	E	.118	.3638	E	.133	.3871	E	.131	.3841
2	A	.093	.3187	E	.123	.3719	A	.084	.3002	I	.108	.3468	R	.120	.3671	T	.105	.3414
3	C	.081	.2937	O	.114	.3571	OT	.081	.2937	A	.077	.2848	O	.117	.3622	A	.082	.2959
4	P	.078	.2871	I	.111	.3520			.2937			.2487	I	.093	.3187	O	.080	.2915
5	R	.066	.2588	U	.078	.2871	L	.078	.2871	C,O,T	.062	.2487	T	.087	.3065	N	.071	.2709
6	M	.063	.2513	R	.075	.2803	D,E	.063	.2513			.2487	N	.073	.2756	R	.068	.2637
7	T	.060	.2435	L	.069	.2661			.2513	N	.059	.2409	L	.067	.2613	I	.063	.2513
8	B	.048	.2103	Y	.060	.2435	N	.057	.2356	M	.053	.2246	S	.057	.2356	S	.061	.2461
9	F,G	.045	.2013	H,N	.039	.1825	C	.054	.2274	D	.050	.2161	A,U	.053	.2246	H	.053	.2246
10			.2013			.1825	I	.052	.2218	S	.043	.1952			.2246	D	.038	.1793
11	E	.042	.1921	T	.024	.1291	S	.045	.2013	L,R	.039	.1825	C,M	.023	.1252	L	.034	.1659
12	D,H	.036	.1727	C	.018	.1043	M,P	.039	.1825			.1825			.1252	F	.029	.1481
13			.1727	P	.015	.0909			.1825			.1760	D	.020	.1129	C	.028	.1444
14	I,N	.033	.1624	M,X	.009	.0612	U	.033	.1624	H,P	.037	.1760	Y	.017	.0999	M,U	.025	.1330
15			.1624			.0612	B	.027	.1407	K,U	.028	.1444	H	.013	.0815			.1330
16	L	.027	.1407			.0443	F	.024	.1291			.1444	P,X	.010	.0664			.1129
17	O,W	.021	.1170	B,S,Z	.006	.0443	G	.018	.1043	B	.022	.1211			.0664	G,Y,P	.020	.1129
18			.1170			.0443	Y	.012	.0766	Y	.019	.1086			.0501			.1129
19	U	.018	.1043			.0251	H	.009	.0612			.0766	F,G,W	.007	.0501	W	.015	.0909
20	V	.015	.0909			.0251	V	.006	.0443	F,J,V	.012	.0766			.0501	B	.014	.0862
21	J	.009	.0612	D,F,G, K,Q,V	.003	.0251			.0251			.0766	K,V	.003	.0251	V	.009	.0612
22			.0443			.0251	K,Q,W, X	.003	.0251	G	.009	.0612			.0251	K	.004	.0319
23			.0443			.0251			.0251			.0251				X	.002	.0179
24	K,Q,X,Z	.006	.0443			.0251			.0251	Q,W,Z	.003	.0251	B,J,Q,Z	0				.0100
25			.0443	J,W 0			J,Z	0				.0251				J,Q,Z	.001	.0100
26	Y	.003	.0251							X	0							.0100
log₂ 26=4.7			4.2920			3.7964			4.1255			4.5201			3.8413			4.1300
R=1−H/(log₂26)			9%			20%			12%			4%			18%			12%
^a Average of five letters, 20.5753/5=4.1151.

Page 911 Cite

Suggested Citation:"Subject-Word Letter Frequencies with Applications to Superimposed Coding." National Research Council. 1959. Proceedings of the International Conference on Scientific Information: Two Volumes. Washington, DC: The National Academies Press. doi: 10.17226/10866.

×

TABLE 6 Subject-word cumulative letter frequencies (in rank order)^a

^a On an equiprobable basis, each letter would occur 3.846% of the time.

Page 912 Cite

Suggested Citation:"Subject-Word Letter Frequencies with Applications to Superimposed Coding." National Research Council. 1959. Proceedings of the International Conference on Scientific Information: Two Volumes. Washington, DC: The National Academies Press. doi: 10.17226/10866.

×

TABLE 7 Weighted letter frequencies, %^a

Letter	First letter	Third letter	Fourth letter
A	8.5	8.2	7.6
B	5.3	2.7	2.5
C	8.0	5.1	5.9
D	3.8	5.8	5.1
E	3.9	6.4	11.8
F	4.4	2.2	1.1
G	4.6	2.1	1.1
H	3.9	1.1	3.6
I	3.0	5.2	10.5
J	1.0	0.0	1.1
K	1.1	0.4	2.9
L	2.9	8.2	4.6
M	6.7	3.5	4.8
N	3.1	6.5	6.1
O	2.2	7.9	6.0
P	7.4	3.6	3.5
Q	0.6	0.3	0.3
R	6.4	12.1	4.2
S	10.0	5.0	4.3
T	5.6	7.7	6.3
U	1.7	3.3	2.7
V	1.5	0.7	1.3
W	2.5	0.5	0.4
X	0.5	0.3	0.0
Y	0.3	1.3	1.8
Z	0.6	0.2	0.4
Check sum	99.5%	100.3%	99.9%
^a All seven parts subject plus one part name.

For the initial letters of subject terms, the rank order was S, A, C, P, R, M, T, · · ·; for second letters, A, E, O, I, U, R, L · · ·; for third, R, A, O or T, L, D or E, · · ·; for fourth, E, I, A, T or O or C, N, · · · ; and for fifth, E, R, O, I, T, N, L · · ·, as shown in Table 5. Cumulated frequencies are given in Table 6.

Table 5 also gives the information measure −p_n log₂p_n for each letter in each position (8). For this purpose, percentage frequencies were assumed to represent actual probabilities, p_n. The sum for each letter position,

represents H, the average uncertainty per letter-position or, as it is sometimes called, the average information represented by the letter position, in bits. The redundancy R is also shown on the bottom for each letter position.

These calculations show that the least redundant (or the most informative) letter position is the fourth, next to that the first, and then the third. Similar results can be shown for proper names.

For the marginal-punched card application, first and third letter positions were selected for coding. Subject-word frequencies were weighted with proper names in a 7-to-1 proportion,² as shown in Table 7. The 52 letters of

²

According to Wise (3), the ratio X/H, or that of the number of positions to be punched to the number of positions available for punching, should be about 0.46. Taking H to be 19, X=8.75. The dropping fraction f_d=(G/H)^Y, or the ratio of the number of positions actually punched to the number available for punching, raised to a power, Y, representing the number of sorting elements used, works out to be (7/19)² =13.7%, if about 9 codes are actually superimposed. Note that

A maximum of 8 coding words was chosen, based on these calculations.

Page 913 Cite

Suggested Citation:"Subject-Word Letter Frequencies with Applications to Superimposed Coding." National Research Council. 1959. Proceedings of the International Conference on Scientific Information: Two Volumes. Washington, DC: The National Academies Press. doi: 10.17226/10866.

×

TABLE 8 Comparison of actual and predicted letter frequencies

First letter^a
Letter	Actual no. of cards dropped	Actual %	Predicted %
Aa-An (median between anx and any)	29	2.6	4.25
Ao-Az	19	1.7	4.25
B	54	4.9	5.3
Ca-Ci (median between cka and cke)	65	5.9	4.0
Ck-Cz	103	9.35	4.0
D	46	4.2	3.8
E	37	3.35	3.9
F	36^c	3.3	4.4
G			4.6
H			3.9
I & J			4.0
K & L			4.0
M			6.7
N & O			5.3
P			7.4
Q & R			7.0
Sa-Si (median between siv and six)			5.0
Sj-Sz			5.0
T			5.6
U-Z			7.1

Third letter^b
Letter	Actual no. of cards dropped	Actual %	Predicted %
aa-aq (median between ard and are)	40	3.8	5.45
ar–az & b	80	7.6	5.45
c	60	5.65	5.1
d	55	5.2	5.8
e	100	9.45	6.4
f, g & h	80	7.6	5.4
i, j & k	45	4.25	5.6
la−lo (median between lov and low)	45	4.25	5.85
lp−lz & m	80	7.6	5.85
n	80	7.6	6.5
oa–os (median between otf and oth)	45	4.25	5.9
ot–oz, p, q	35	3.3	5.9
ra-rg (median between rge and rgo)	40	3.8	6.05
rh–rz	65	6.1	6.05
s	65	6.1	5.0
ta–th (median between tid and tie)	40	3.8	3.85
ti–tz	35	3.3	3.85
u-z	70	6.6	6.3
Total	1060
Avg.	59
^a Ideally each first letter position would comprise 5%. ^b Ideally each third letter position would comprise 5.5%. ^c Not carried to completion. Note: About 400 cards were used in study. Actual number was estimated by measuring cards dropped at 150 cards/inch. Predicted percentage based on Table 7. Dropping fraction, F_d=(G/H)^Y⁼(2.7/18)¹=15%. For 400 cards, F_d=60. H is the number of coding positions; G is the number of punches/card=1100/400; Y is the number of sorting positions=1. (See Wise (3) for derivation.)

first and third positions were then assigned to the 38 available positions as equally as possible, but under the restriction that alphabetical order along the side of the card be preserved. The result is shown in Fig. 1, and Table 8 shows the predicted frequency distribution for this code. Note that it was necessary sometimes to combine several letters in one position, and sometimes to split

Page 914 Cite

Suggested Citation:"Subject-Word Letter Frequencies with Applications to Superimposed Coding." National Research Council. 1959. Proceedings of the International Conference on Scientific Information: Two Volumes. Washington, DC: The National Academies Press. doi: 10.17226/10866.

×

one letter between two positions. These splits were chosen according to the median frequencies of English trigrams (9).

Splitting letters modifies the first-letter position somewhat by the second, and third somewhat by the fourth. Such letter-pair frequencies take account of intersymbol influence, and therefore make possible a better code than single-letter frequencies. H.P.Luhn has designed a superimposed code using randomizing squares (10) which takes advantage of letter-pair frequencies.

An empirical check of the letter code shown in Fig. 1 was made on a 400-card file maintained by the author. The results are shown in Table 8³. The average dropping fraction for the third position alone compares well with the dropping fraction as calculated by formula, but the range (from 9 to 25%) is broader than hoped for. However, Table 8 shows that the agreement between actual and predicted frequencies in the third-letter position was very good, considering the alphabetic-order limitation imposed in assigning the positions.

By using data-processing equipment, much more elaborate studies on much larger samples would be possible. The author is working with such equipment and hopes to have some results available in the near future.

Equifrequency-letter codes have many other applications, including the preassignment of space in files and indexes, in cryptography, and in philology. For example, the data in Table 5 can provide a quantitative measure of subject word popularity. Taking a few words from the Library of Congress list of subject headings, we add the percentage frequencies of each letter (up to 5) together and divide by the number of letters. (Multiply each p_n by 100 to get the percentage frequency.)

AIRCRAFT has a value of 9.3+11.1+12.0+6.2+12.0, 10.12

DIVIDER has a value of 3.6+11.1+0.6+10.8+2.0, 5.66

ICHTHYOLOGY has a value of 3.3+1.8+0.9+6.2+1.3, 2.70

These three words give some idea of the range possible in a subject-heading list. In general dictionary words, the highest found was SARI, with a value of 12.63, and the lowest, ONYX, with a value of 1.8. It is interesting to compare these values with the highest possible letter combination (not necessarily an English word), which is SAREE (value 12.96), and the lowest (value 0.06), The highest is very nearly realized in actuality, while the lowest never comes close. Also note that the word SARI is certainly uncommon English; this phenomenon may occur because the intersymbol connections are broken by taking single-letter frequencies.

³

Since the first-letter positions showed quite wide deviations from the predicted frequencies, their analysis was never completed. It is now thought that third and fourth letter positions would have made a more invariant code, less subject to the fluctuations which occur in any particular file because of the selection of particular terms.

Page 915 Cite

Suggested Citation:"Subject-Word Letter Frequencies with Applications to Superimposed Coding." National Research Council. 1959. Proceedings of the International Conference on Scientific Information: Two Volumes. Washington, DC: The National Academies Press. doi: 10.17226/10866.

×

ACKNOWLEDGMENT

The work described in this paper was performed while the author was in the employ of Carrier Corporation, Syracuse, New York.

REFERENCES

1. C.E.SHANNON, A Mathematical Theory of Communication, Bell System Technical Journal, July 1948, and following.

2. C.N.MOOERS, Zatocoding and Development in Information Retrieval, ASLIB Proc., February 1956, p. 3 (Many other papers by this author may be obtained from his Zator Co., 79 Milk Street, Boston, Massachusetts.)

3. C.S.WISE, A Punched-Card File Based on Word Coding, pp. 93–114, in Perry and Casey’s Punched Cards, Reinhold Publishing Corporation, New York, 1951.

4. MOOERS and WISE had discussions in American Documentation, April 1950, October 1950, and October 1952.

5. H.OHLMAN, The Low-Cost Production of Marginal-Punched Cards on Accounting Machines, pp. 123–26 American Documentation, April 1957.

6. JOINT COMMITTEE OF ASM AND SLA, ASM-SLA Metallurgical Literature Classification, American Society for Metals, 1950. (Figure 5, which was based on an analysis of 4870 names by A.H.Geisler in ASM Review of Metallurgical Literature.)

7. K.A.KRIEGER, A Punched-Card System for Chemical Literature, J. of Chemical Education

, March 1949, p. 163.

8. E.T.KLEMMER, Tables for Computing Informational Measures, p. 75 in Quastler’s Information Theory in Psychology, Free Press, Glencoe, Ill., 1955.

9. F.PRATT, Secret and Urgent, The Story of Codes and Ciphers, Blue Ribbon Books, Garden City, N.Y., 1942, pp. 264–78.

10. H.P.LUHN, Superimposed Coding With the Aid of Randomizing Squares for Use in Mechanical Information Searching Systems, IBM Product Development Lab., Poughkeepsie, New York, 1956.

Page 916 Cite

Suggested Citation:"Subject-Word Letter Frequencies with Applications to Superimposed Coding." National Research Council. 1959. Proceedings of the International Conference on Scientific Information: Two Volumes. Washington, DC: The National Academies Press. doi: 10.17226/10866.

×